Thread: Reviewing freeze map code
Hi, The freeze map changes, besides being very important, seem to be one of the patches with a high risk profile in 9.6. Robert had asked whether I'd take a look. I thought it'd be a good idea to review that while running tests for http://www.postgresql.org/message-id/CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+Pce93Sftw@mail.gmail.com For starters, I'm just going through the commits. It seems the relevant pieces are: a892234 Change the format of the VM fork to add a second bit per page. 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN. fd31cd2 Don't vacuum all-frozen pages. 7087166 pg_upgrade: Convert old visibility map format to new format. ba0a198 Add pg_visibility contrib module. did I miss anything important? Greetings, Andres Freund
Hi, some of the review items here are mere matters of style/preference. Feel entirely free to discard them, but I thought if I'm going through the change anyway... On 2016-05-02 14:48:18 -0700, Andres Freund wrote: > a892234 Change the format of the VM fork to add a second bit per page. TL;DR: fairly minor stuff. + * heap_tuple_needs_eventual_freeze + * + * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac) + * will eventually require freezing. Similar to heap_tuple_needs_freeze, + * but there's no cutoff, since we're trying to figure out whether freezing + * will ever be needed, not whether it's needed now. + */ +bool +heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple) Wouldn't redefining this to heap_tuple_is_frozen() and then inverting the checks be easier to understand? + /* + * If xmax is a valid xact or multixact, this tuple is also not frozen. + */ + if (tuple->t_infomask & HEAP_XMAX_IS_MULTI) + { + MultiXactId multi; + + multi = HeapTupleHeaderGetRawXmax(tuple); + if (MultiXactIdIsValid(multi)) + return true; + } Hm. What's the test inside the if() for? There shouldn't be any case where xmax is invalid if HEAP_XMAX_IS_MULTI is set. Now there's a check like that outside of this commit, but it seems strange to me (Alvaro, perhaps you could comment on this?). + * + * Clearing both visibility map bits is not separately WAL-logged. The callers * must make sure that whenever a bit iscleared, the bit is cleared on WAL * replay of the updating operation as well. I think including "both" here makes things less clear, because it differentiates clearing one bit from clearing both. There's no practical differentce atm, but still. * * VACUUM will normally skip pages for which the visibility map bit is set; * such pages can't contain any dead tuplesand therefore don't need vacuuming. - * The visibility map is not used for anti-wraparound vacuums, because - * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid - * present in the table, even on pages that don't have any dead tuples. * I think the remaining sentence isn't entirely accurate, there's now more than one bit, and they're different with regard to scan_all/!scan_all vacuums (or will be - maybe this updated further in a later commit? But if so, that sentence shouldn't yet be removed...). - -/* Number of heap blocks we can represent in one byte. */ -#define HEAPBLOCKS_PER_BYTE 8 - Hm, why was this moved to the header? Sounds like something the outside shouldn't care about. #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK) Hm. This isn't really a mapping to an individual bit anymore - but I don't really have a better name in mind. Maybe TO_OFFSET? +static const uint8 number_of_ones_for_visible[256] = { ... +}; +static const uint8 number_of_ones_for_frozen[256] = { ...}; Did somebody verify the new contents are correct? /* - * visibilitymap_clear - clear a bit in visibility map + * visibilitymap_clear - clear all bits in visibility map * This seems rather easy to misunderstand, as this really only clears all the bits for one page, not actually all the bits. * the bit for heapBlk, or InvalidBuffer. The caller is responsible for - * releasing *buf after it's done testing and setting bits. + * releasing *buf after it's done testing and setting bits, and must pass flags + * for which it needs to check the value in visibility map. * * NOTE: This function is typically called without a lock onthe heap page, * so somebody else could change the bit just after we look at it. In fact, @@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, I'm not seing what flags the above comment change is referring to? /* - * A single-bit read is atomic. There could be memory-ordering effects + * A single byte read is atomic. There could be memory-ordering effects * here, but for performance reasons wemake it the caller's job to worry * about that. */ - result = (map[mapByte] & (1 << mapBit)) ? true : false; - - return result; + return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);} Not a new issue, and *very* likely to be irrelevant in practice (given the value is only referenced once): But there's really no guarantee map[mapByte] is only read once here. -BlockNumber -visibilitymap_count(Relation rel) +void +visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen) Not really a new issue again: The parameter types (previously return type) to this function seem wrong to me. @@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut } + /* + * We don't bother clearing *all_frozen when the page is discovered not + * to be all-visible, so do that now if necessary. The page might fail + * to be all-frozen for other reasons anyway, but if it's not all-visible, + * then it definitely isn't all-frozen. + */ + if (!all_visible) + *all_frozen = false; + Why don't we just set *all_frozen to false when appropriate? It'd be just as many lines and probably easier to understand? + /* + * If the page is marked as all-visible but not all-frozen, we should + * so mark it. Note that all_frozen is only valid if all_visible is + * true, so we must check both. + */ This kinda seems to imply that all-visible implies all_frozen. Also, why has that block been added to the end of the if/else if chain? Seems like it belongs below the (all_visible && !all_visible_according_to_vm) block. Greetings, Andres Freund
On Tue, May 3, 2016 at 6:48 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > The freeze map changes, besides being very important, seem to be one of > the patches with a high risk profile in 9.6. Robert had asked whether > I'd take a look. I thought it'd be a good idea to review that while > running tests for > http://www.postgresql.org/message-id/CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+Pce93Sftw@mail.gmail.com Thank you for reviewing. > For starters, I'm just going through the commits. It seems the relevant > pieces are: > > a892234 Change the format of the VM fork to add a second bit per page. > 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN. > fd31cd2 Don't vacuum all-frozen pages. > 7087166 pg_upgrade: Convert old visibility map format to new format. > ba0a198 Add pg_visibility contrib module. > > did I miss anything important? > That's all. Regards, -- Masahiko Sawada
On 2016-05-02 14:48:18 -0700, Andres Freund wrote: > 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN. Nothing to say here. > fd31cd2 Don't vacuum all-frozen pages. Hm. I do wonder if it's going to bite us that we don't have a way to actually force vacuuming of the whole table (besides manually rm'ing the VM). I've more than once seen VACUUM used to try to do some integrity checking of the database. How are we actually going to test that the feature works correctly? They'd have to write checks ontop of pg_visibility to see whether things are borked. /* * Compute whether we actually scanned the whole relation. If we did, we * can adjust relfrozenxid and relminmxid. * *NB: We need to check this before truncating the relation, because that * will change ->rel_pages. */ Comment is out-of-date now. - if (blkno == next_not_all_visible_block) + if (blkno == next_unskippable_block) { - /* Time to advance next_not_all_visible_block */ - for (next_not_all_visible_block++; - next_not_all_visible_block < nblocks; - next_not_all_visible_block++) + /* Time to advance next_unskippable_block */ + for (next_unskippable_block++; + next_unskippable_block < nblocks; + next_unskippable_block++) Hm. So we continue with the course of re-processing pages, even if they're marked all-frozen. For all-visible there at least can be a benefit by freezing earlier, but for all-frozen pages there's really no point. I don't really buy the arguments for the skipping logic. But even disregarding that, maybe we should skip processing a block if it's all-frozen (without preventing the page from being read?); as there's no possible benefit? Acquring the exclusive/content lock and stuff is far from free. Not really related to this patch, but the FORCE_CHECK_PAGE is rather ugly. + /* + * The current block is potentially skippable; if we've seen a + * long enough run of skippable blocks to justify skipping it, and + * we're not forced to check it, then go ahead and skip. + * Otherwise, the page must be at least all-visible if not + * all-frozen, so we can set all_visible_according_to_vm = true. + */ + if (skipping_blocks && !FORCE_CHECK_PAGE()) + { + /* + * Tricky, tricky. If this is in aggressive vacuum, the page + * must have been all-frozen at the time we checked whether it + * was skippable, but it might not be any more. We must be + * careful to count it as a skipped all-frozen page in that + * case, or else we'll think we can't update relfrozenxid and + * relminmxid. If it's not an aggressive vacuum, we don't + * know whether it was all-frozen, so we have to recheck; but + * in this case an approximate answer is OK. + */ + if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer)) + vacrelstats->frozenskipped_pages++; continue; + } Hm. This indeed seems a bit tricky. Not sure how to make it easier though without just ripping out the SKIP_PAGES_THRESHOLD stuff. Hm. This also doubles the number of VM accesses. While I guess that's not noticeable most of the time, it's still not nice; especially when a large relation is entirely frozen, because it'll mean we'll sequentially go through the visibilityma twice. I wondered for a minute whether #14057 could cause really bad issues here http://www.postgresql.org/message-id/20160331103739.8956.94469@wrigleys.postgresql.org but I don't see it being more relevant here. Andres
Hi, On 2016-05-02 14:48:18 -0700, Andres Freund wrote: > 7087166 pg_upgrade: Convert old visibility map format to new format. +const char * +rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force) ... + while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ) + { .. Uh, shouldn't we actually fail if we read incompletely? Rather than silently ignoring the problem? Ok, this causes no corruption, but it indicates that something went significantly wrong. + char new_vmbuf[BLCKSZ]; + char *new_cur = new_vmbuf; + bool empty = true; + bool old_lastpart; + + /* Copy page header in advance */ + memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData); Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it with old_lastpart && !empty, right? + if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0) + { + close(src_fd); + return getErrorText(); + } I know you guys copied this, but what's the force thing about? Expecially as it's always set to true by the callers (i.e. what is the parameter even about?)? Wouldn't we at least have to specify O_TRUNC in the force case? + old_cur += BITS_PER_HEAPBLOCK_OLD; + new_cur += BITS_PER_HEAPBLOCK; I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not be able to have differing ones anyway, should we decide to add a third bit? - Andres
On Mon, May 2, 2016 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote: > + * heap_tuple_needs_eventual_freeze > + * > + * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac) > + * will eventually require freezing. Similar to heap_tuple_needs_freeze, > + * but there's no cutoff, since we're trying to figure out whether freezing > + * will ever be needed, not whether it's needed now. > + */ > +bool > +heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple) > > Wouldn't redefining this to heap_tuple_is_frozen() and then inverting the > checks be easier to understand? I thought it much safer to keep this as close to a copy of heap_tuple_needs_freeze() as possible. Copying a function and inverting all of the return values is much more likely to introduce bugs, IME. > + /* > + * If xmax is a valid xact or multixact, this tuple is also not frozen. > + */ > + if (tuple->t_infomask & HEAP_XMAX_IS_MULTI) > + { > + MultiXactId multi; > + > + multi = HeapTupleHeaderGetRawXmax(tuple); > + if (MultiXactIdIsValid(multi)) > + return true; > + } > > Hm. What's the test inside the if() for? There shouldn't be any case > where xmax is invalid if HEAP_XMAX_IS_MULTI is set. Now there's a > check like that outside of this commit, but it seems strange to me > (Alvaro, perhaps you could comment on this?). Here again I was copying existing code, with appropriate simplifications. > + * > + * Clearing both visibility map bits is not separately WAL-logged. The callers > * must make sure that whenever a bit is cleared, the bit is cleared on WAL > * replay of the updating operation as well. > > I think including "both" here makes things less clear, because it > differentiates clearing one bit from clearing both. There's no practical > differentce atm, but still. I agree. > * > * VACUUM will normally skip pages for which the visibility map bit is set; > * such pages can't contain any dead tuples and therefore don't need vacuuming. > - * The visibility map is not used for anti-wraparound vacuums, because > - * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid > - * present in the table, even on pages that don't have any dead tuples. > * > > I think the remaining sentence isn't entirely accurate, there's now more > than one bit, and they're different with regard to scan_all/!scan_all > vacuums (or will be - maybe this updated further in a later commit? But > if so, that sentence shouldn't yet be removed...). We can adjust the language, but I don't really see a big problem here. > -/* Number of heap blocks we can represent in one byte. */ > -#define HEAPBLOCKS_PER_BYTE 8 > - > Hm, why was this moved to the header? Sounds like something the outside > shouldn't care about. Oh... yeah. Let's undo that. > #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK) > > Hm. This isn't really a mapping to an individual bit anymore - but I > don't really have a better name in mind. Maybe TO_OFFSET? Well, it sorta is... but we could change it, I suppose. > +static const uint8 number_of_ones_for_visible[256] = { > ... > +}; > +static const uint8 number_of_ones_for_frozen[256] = { > ... > }; > > Did somebody verify the new contents are correct? I admit that I didn't. It seemed like an unlikely place for a goof, but I guess we should verify. > /* > - * visibilitymap_clear - clear a bit in visibility map > + * visibilitymap_clear - clear all bits in visibility map > * > > This seems rather easy to misunderstand, as this really only clears all > the bits for one page, not actually all the bits. We could change "in" to "for one page in the". > * the bit for heapBlk, or InvalidBuffer. The caller is responsible for > - * releasing *buf after it's done testing and setting bits. > + * releasing *buf after it's done testing and setting bits, and must pass flags > + * for which it needs to check the value in visibility map. > * > * NOTE: This function is typically called without a lock on the heap page, > * so somebody else could change the bit just after we look at it. In fact, > @@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, > > I'm not seing what flags the above comment change is referring to? Ugh. I think that's leftover cruft from an earlier patch version that should have been excised from what got committed. > /* > - * A single-bit read is atomic. There could be memory-ordering effects > + * A single byte read is atomic. There could be memory-ordering effects > * here, but for performance reasons we make it the caller's job to worry > * about that. > */ > - result = (map[mapByte] & (1 << mapBit)) ? true : false; > - > - return result; > + return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS); > } > > Not a new issue, and *very* likely to be irrelevant in practice (given > the value is only referenced once): But there's really no guarantee > map[mapByte] is only read once here. Meh. But we can fix if you want to. > -BlockNumber > -visibilitymap_count(Relation rel) > +void > +visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen) > > Not really a new issue again: The parameter types (previously return > type) to this function seem wrong to me. Not this patch's job to tinker. > @@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut > } > + /* > + * We don't bother clearing *all_frozen when the page is discovered not > + * to be all-visible, so do that now if necessary. The page might fail > + * to be all-frozen for other reasons anyway, but if it's not all-visible, > + * then it definitely isn't all-frozen. > + */ > + if (!all_visible) > + *all_frozen = false; > + > > Why don't we just set *all_frozen to false when appropriate? It'd be > just as many lines and probably easier to understand? I thought that looked really easy to mess up, either now or down the road. This way seemed more solid to me. That's a judgement call, of course. > + /* > + * If the page is marked as all-visible but not all-frozen, we should > + * so mark it. Note that all_frozen is only valid if all_visible is > + * true, so we must check both. > + */ > > This kinda seems to imply that all-visible implies all_frozen. Also, why > has that block been added to the end of the if/else if chain? Seems like > it belongs below the (all_visible && !all_visible_according_to_vm) block. We can adjust the comment a bit to make it more clear, if you like, but I doubt it's going to cause serious misunderstanding. As for the placement, the reason I put it at the end is because I figured that we did not want to mark it all-frozen if any of the "oh crap, emit a warning" cases applied. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-05-02 14:48:18 -0700, Andres Freund wrote: >> 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN. > > Nothing to say here. > >> fd31cd2 Don't vacuum all-frozen pages. > > Hm. I do wonder if it's going to bite us that we don't have a way to > actually force vacuuming of the whole table (besides manually rm'ing the > VM). I've more than once seen VACUUM used to try to do some integrity > checking of the database. How are we actually going to test that the > feature works correctly? They'd have to write checks ontop of > pg_visibility to see whether things are borked. Let's add VACUUM (FORCE) or something like that. > /* > * Compute whether we actually scanned the whole relation. If we did, we > * can adjust relfrozenxid and relminmxid. > * > * NB: We need to check this before truncating the relation, because that > * will change ->rel_pages. > */ > > Comment is out-of-date now. OK. > - if (blkno == next_not_all_visible_block) > + if (blkno == next_unskippable_block) > { > - /* Time to advance next_not_all_visible_block */ > - for (next_not_all_visible_block++; > - next_not_all_visible_block < nblocks; > - next_not_all_visible_block++) > + /* Time to advance next_unskippable_block */ > + for (next_unskippable_block++; > + next_unskippable_block < nblocks; > + next_unskippable_block++) > > Hm. So we continue with the course of re-processing pages, even if > they're marked all-frozen. For all-visible there at least can be a > benefit by freezing earlier, but for all-frozen pages there's really no > point. I don't really buy the arguments for the skipping logic. But > even disregarding that, maybe we should skip processing a block if it's > all-frozen (without preventing the page from being read?); as there's no > possible benefit? Acquring the exclusive/content lock and stuff is far > from free. I wanted to tinker with this logic as little as possible in the interest of ending up with something that worked. I would not have written it this way. > Not really related to this patch, but the FORCE_CHECK_PAGE is rather > ugly. +1. > + /* > + * The current block is potentially skippable; if we've seen a > + * long enough run of skippable blocks to justify skipping it, and > + * we're not forced to check it, then go ahead and skip. > + * Otherwise, the page must be at least all-visible if not > + * all-frozen, so we can set all_visible_according_to_vm = true. > + */ > + if (skipping_blocks && !FORCE_CHECK_PAGE()) > + { > + /* > + * Tricky, tricky. If this is in aggressive vacuum, the page > + * must have been all-frozen at the time we checked whether it > + * was skippable, but it might not be any more. We must be > + * careful to count it as a skipped all-frozen page in that > + * case, or else we'll think we can't update relfrozenxid and > + * relminmxid. If it's not an aggressive vacuum, we don't > + * know whether it was all-frozen, so we have to recheck; but > + * in this case an approximate answer is OK. > + */ > + if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer)) > + vacrelstats->frozenskipped_pages++; > continue; > + } > > Hm. This indeed seems a bit tricky. Not sure how to make it easier > though without just ripping out the SKIP_PAGES_THRESHOLD stuff. Yep, I had the same problem. > Hm. This also doubles the number of VM accesses. While I guess that's > not noticeable most of the time, it's still not nice; especially when a > large relation is entirely frozen, because it'll mean we'll sequentially > go through the visibilityma twice. Compared to what we're saving, that's obviously a trivial cost. That's not to say that we might not want to improve it, but it's hardly a disaster. In short: wah, wah, wah. > I wondered for a minute whether #14057 could cause really bad issues > here > http://www.postgresql.org/message-id/20160331103739.8956.94469@wrigleys.postgresql.org > but I don't see it being more relevant here. I don't really understand what the concern is here, but if it's not a problem, let's not spend time trying to clarify. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-05-02 14:48:18 -0700, Andres Freund wrote: >> 7087166 pg_upgrade: Convert old visibility map format to new format. > > +const char * > +rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force) > ... > > + while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ) > + { > .. > > Uh, shouldn't we actually fail if we read incompletely? Rather than > silently ignoring the problem? Ok, this causes no corruption, but it > indicates that something went significantly wrong. Sure, that's reasonable. > + char new_vmbuf[BLCKSZ]; > + char *new_cur = new_vmbuf; > + bool empty = true; > + bool old_lastpart; > + > + /* Copy page header in advance */ > + memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData); > > Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it > with old_lastpart && !empty, right? Oh, dear. That seems like a possible data corruption bug. Maybe we'd better fix that right away (although I don't actually have time before the wrap). > + if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0) > + { > + close(src_fd); > + return getErrorText(); > + } > > I know you guys copied this, but what's the force thing about? > Expecially as it's always set to true by the callers (i.e. what is the > parameter even about?)? Wouldn't we at least have to specify O_TRUNC in > the force case? I just work here. > + old_cur += BITS_PER_HEAPBLOCK_OLD; > + new_cur += BITS_PER_HEAPBLOCK; > > I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD > stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not > be able to have differing ones anyway, should we decide to add a third > bit? I think that's just a matter of style. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 05/06/2016 01:40 PM, Robert Haas wrote: > On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-05-02 14:48:18 -0700, Andres Freund wrote: >>> 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN. >> >> Nothing to say here. >> >>> fd31cd2 Don't vacuum all-frozen pages. >> >> Hm. I do wonder if it's going to bite us that we don't have a way to >> actually force vacuuming of the whole table (besides manually rm'ing the >> VM). I've more than once seen VACUUM used to try to do some integrity >> checking of the database. How are we actually going to test that the >> feature works correctly? They'd have to write checks ontop of >> pg_visibility to see whether things are borked. > > Let's add VACUUM (FORCE) or something like that. This is actually inverted. Vacuum by default should vacuum the entire relation, however if we are going to keep the existing behavior of this patch, VACUUM (FROZEN) seems to be better than (FORCE)? Sincerely, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On 2016-05-06 13:48:09 -0700, Joshua D. Drake wrote: > On 05/06/2016 01:40 PM, Robert Haas wrote: > > On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote: > > > On 2016-05-02 14:48:18 -0700, Andres Freund wrote: > > > > 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN. > > > > > > Nothing to say here. > > > > > > > fd31cd2 Don't vacuum all-frozen pages. > > > > > > Hm. I do wonder if it's going to bite us that we don't have a way to > > > actually force vacuuming of the whole table (besides manually rm'ing the > > > VM). I've more than once seen VACUUM used to try to do some integrity > > > checking of the database. How are we actually going to test that the > > > feature works correctly? They'd have to write checks ontop of > > > pg_visibility to see whether things are borked. > > > > Let's add VACUUM (FORCE) or something like that. Yes, that makes sense. > This is actually inverted. Vacuum by default should vacuum the entire > relation What? Why on earth would that be a good idea? Not to speak of hte fact that that's not been the case since ~8.4? >,however if we are going to keep the existing behavior of this > patch, VACUUM (FROZEN) seems to be better than (FORCE)? There already is FREEZE - meaning something different - so I doubt it. Andres
On 05/06/2016 01:50 PM, Andres Freund wrote: >>> Let's add VACUUM (FORCE) or something like that. > > Yes, that makes sense. > > >> This is actually inverted. Vacuum by default should vacuum the entire >> relation > > What? Why on earth would that be a good idea? Not to speak of hte fact > that that's not been the case since ~8.4? Sorry, I just meant the default behavior shouldn't change but I do agree that we need the ability to keep the same behavior. >> ,however if we are going to keep the existing behavior of this >> patch, VACUUM (FROZEN) seems to be better than (FORCE)? > > There already is FREEZE - meaning something different - so I doubt it. Yeah I thought about that, it is the word "FORCE" that bothers me. When you use FORCE there is an assumption that no matter what, it plows through (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work either. Sincerely, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
* Joshua D. Drake (jd@commandprompt.com) wrote: > Yeah I thought about that, it is the word "FORCE" that bothers me. > When you use FORCE there is an assumption that no matter what, it > plows through (think rm -f). So if we don't use FROZEN, that's cool > but FORCE doesn't work either. Isn't that exactly what this FORCE option being contemplated would do though? Plow through the entire relation, regardless of what the VM says is all frozen or not? Seems like FORCE is a good word for that to me. Thanks! Stephen
On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote: > On 05/06/2016 01:50 PM, Andres Freund wrote: > > > > > Let's add VACUUM (FORCE) or something like that. > > > > Yes, that makes sense. > > > > > > > This is actually inverted. Vacuum by default should vacuum the entire > > > relation > > > > What? Why on earth would that be a good idea? Not to speak of hte fact > > that that's not been the case since ~8.4? > > Sorry, I just meant the default behavior shouldn't change but I do agree > that we need the ability to keep the same behavior. Which default behaviour shouldn't change? The one in master where we skip known frozen pages? Or the released branches where can't skip those? > > > ,however if we are going to keep the existing behavior of this > > > patch, VACUUM (FROZEN) seems to be better than (FORCE)? > > > > There already is FREEZE - meaning something different - so I doubt it. > > Yeah I thought about that, it is the word "FORCE" that bothers me. When you > use FORCE there is an assumption that no matter what, it plows through > (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work > either. SCANALL?
On 05/06/2016 01:58 PM, Stephen Frost wrote: > * Joshua D. Drake (jd@commandprompt.com) wrote: >> Yeah I thought about that, it is the word "FORCE" that bothers me. >> When you use FORCE there is an assumption that no matter what, it >> plows through (think rm -f). So if we don't use FROZEN, that's cool >> but FORCE doesn't work either. > > Isn't that exactly what this FORCE option being contemplated would do > though? Plow through the entire relation, regardless of what the VM > says is all frozen or not? > > Seems like FORCE is a good word for that to me. Except that we aren't FORCING a vacuum. That is the part I have contention with. To me, FORCE means: No matter what else is happening, we are vacuuming this relation (think locks). But I am also not going to dig in my heals. If that is truly what -hackers come up with, thank you at least considering what I said. Sincerely, JD > > Thanks! > > Stephen > -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On 05/06/2016 01:58 PM, Andres Freund wrote: > On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote: >> On 05/06/2016 01:50 PM, Andres Freund wrote: >>> There already is FREEZE - meaning something different - so I doubt it. >> >> Yeah I thought about that, it is the word "FORCE" that bothers me. When you >> use FORCE there is an assumption that no matter what, it plows through >> (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work >> either. > > SCANALL? > VACUUM THEWHOLEDAMNTHING -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
On 05/06/2016 02:01 PM, Josh berkus wrote: > On 05/06/2016 01:58 PM, Andres Freund wrote: >> On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote: >>> On 05/06/2016 01:50 PM, Andres Freund wrote: > >>>> There already is FREEZE - meaning something different - so I doubt it. >>> >>> Yeah I thought about that, it is the word "FORCE" that bothers me. When you >>> use FORCE there is an assumption that no matter what, it plows through >>> (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work >>> either. >> >> SCANALL? >> > > VACUUM THEWHOLEDAMNTHING > I know that would never fly but damn if that wouldn't be an awesome keyword for VACUUM. JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
* Josh berkus (josh@agliodbs.com) wrote: > On 05/06/2016 01:58 PM, Andres Freund wrote: > > On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote: > >> On 05/06/2016 01:50 PM, Andres Freund wrote: > > >>> There already is FREEZE - meaning something different - so I doubt it. > >> > >> Yeah I thought about that, it is the word "FORCE" that bothers me. When you > >> use FORCE there is an assumption that no matter what, it plows through > >> (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work > >> either. > > > > SCANALL? > > > > VACUUM THEWHOLEDAMNTHING +100 (hahahaha) Thanks! Stephen
On 2016-05-06 14:03:11 -0700, Joshua D. Drake wrote: > On 05/06/2016 02:01 PM, Josh berkus wrote: > > On 05/06/2016 01:58 PM, Andres Freund wrote: > > > On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote: > > > > On 05/06/2016 01:50 PM, Andres Freund wrote: > > > > > > > There already is FREEZE - meaning something different - so I doubt it. > > > > > > > > Yeah I thought about that, it is the word "FORCE" that bothers me. When you > > > > use FORCE there is an assumption that no matter what, it plows through > > > > (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work > > > > either. > > > > > > SCANALL? > > > > > > > VACUUM THEWHOLEDAMNTHING > > > > I know that would never fly but damn if that wouldn't be an awesome keyword > for VACUUM. It bothers me more than it probably should: Nobdy tests, reviews, whatever a complex patch with significant data-loss potential. But as soon somebody dares to mention an option name...
On 05/06/2016 02:03 PM, Stephen Frost wrote: >> >> VACUUM THEWHOLEDAMNTHING > > +100 > > (hahahaha) You know what? Why not? Seriously? We aren't product. This is supposed to be a bit fun. Let's have some fun with it? It would be so easy to turn that into a positive advocacy opportunity. JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On 05/06/2016 02:08 PM, Andres Freund wrote: > It bothers me more than it probably should: Nobdy tests, reviews, > whatever a complex patch with significant data-loss potential. But as > soon somebody dares to mention an option name... Definitely more than it should, because it's gonna happen *every* time. https://en.wikipedia.org/wiki/Law_of_triviality -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
On 2016-05-06 14:10:04 -0700, Josh berkus wrote: > On 05/06/2016 02:08 PM, Andres Freund wrote: > > > It bothers me more than it probably should: Nobdy tests, reviews, > > whatever a complex patch with significant data-loss potential. But as > > soon somebody dares to mention an option name... > > Definitely more than it should, because it's gonna happen *every* time. > > https://en.wikipedia.org/wiki/Law_of_triviality Doesn't mean it should not be frowned upon.
On 05/06/2016 02:12 PM, Andres Freund wrote: > On 2016-05-06 14:10:04 -0700, Josh berkus wrote: >> On 05/06/2016 02:08 PM, Andres Freund wrote: >> >>> It bothers me more than it probably should: Nobdy tests, reviews, >>> whatever a complex patch with significant data-loss potential. But as >>> soon somebody dares to mention an option name... >> >> Definitely more than it should, because it's gonna happen *every* time. >> >> https://en.wikipedia.org/wiki/Law_of_triviality > > Doesn't mean it should not be frowned upon. Or made light of, hence my post. Personally I don't care what the option is called, as long as we have docs for it. For the serious testing, does anyone have a good technique for creating loads which would stress-test vacuum freezing? It's hard for me to come up with anything which wouldn't be very time-and-resource intensive (like running at 10,000 TPS for a week). -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
On 05/06/2016 02:08 PM, Andres Freund wrote: >>> VACUUM THEWHOLEDAMNTHING >>> >> >> I know that would never fly but damn if that wouldn't be an awesome keyword >> for VACUUM. > > It bothers me more than it probably should: Nobdy tests, reviews, > whatever a complex patch with significant data-loss potential. But as > soon somebody dares to mention an option name... That is a fair complaint but let me ask you something: How do I test? Is there a script I can run? Are there specific things I can do to try and break it? What are we looking for exactly? A lot of -hackers seem to forget that although we have 100 -hackers, we have 10000 "consultant/practitioners". Could I read the code and with a weekend of WTF and -hackers questions figure out what is going on, yes but a lot of people couldn't and I don't have the time. You want me (or people like me) to test more? Give us an easy way to do it. Otherwise, we do what we can, which is try and interface on the things that will directly and immediately affect us (like keywords and syntax). Sincerely, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On 2016-05-06 14:15:47 -0700, Josh berkus wrote: > For the serious testing, does anyone have a good technique for creating > loads which would stress-test vacuum freezing? It's hard for me to come > up with anything which wouldn't be very time-and-resource intensive > (like running at 10,000 TPS for a week). I've changed the limits for freezing options a while back, so you can now set autovacuum_freeze_max as low as 100000 (best set vacuum_freeze_table_age accordingly). You'll have to come up with a workload that doesn't overwrite all data continuously (otherwise there'll never be old rows), but otherwise it should now be fairly easy to test that kind of scenario. Andres
Hi, On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote: > How do I test? > > Is there a script I can run? Unfortunately there's few interesting things to test with pre-made scripts. There's no relevant OS dependency here, so each already existing test doesn't really lead to significantly increased coverage being run by other people. Generally, when testing for correctness issues, it's often of limited benefit to run tests written by the author of reviewer - such scripts will usually just test things either has thought of. The dangerous areas are the ones neither author or reviewer has considered. > Are there specific things I can do to try and break it? Upgrade clusters using pg_upgrade and make sure things like index only scans still work and yield correct data. Set up workloads that involve freezing, and check that less WAL (and not more!) is generated with 9.6 than with 9.5. Make sure queries still work. > What are we looking for exactly? Data corruption, efficiency problems. > A lot of -hackers seem to forget that although we have 100 -hackers, we have > 10000 "consultant/practitioners". Could I read the code and with a weekend > of WTF and -hackers questions figure out what is going on, yes but a lot of > people couldn't and I don't have the time. I think tests without reading the code are quite sensible and important. And it perfectly makes sense to ask for information about what to test. But fundamentally testing is a lot of work, as is writing and reviewing code; unless you're really really good at destructive testing, you won't find much in a 15 minute break. > You want me (or people like me) to test more? Give us an easy way to > do it. Useful additional testing and easy just don't go well together. By the time I have made it easy I've done the testing that's needed. > Otherwise, we do what we can, which is try and interface on the things that > will directly and immediately affect us (like keywords and syntax). The amount of bikeshedding on -hackers steals energy and time for actually working on stuff, including testing. So I have little sympathy for the amount of bike shedding done. Greetings, Andres Freund
Joshua D. Drake wrote: > On 05/06/2016 01:40 PM, Robert Haas wrote: > >On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote: > >>On 2016-05-02 14:48:18 -0700, Andres Freund wrote: > >>>77a1d1e Department of second thoughts: remove PD_ALL_FROZEN. > >> > >>Nothing to say here. > >> > >>>fd31cd2 Don't vacuum all-frozen pages. > >> > >>Hm. I do wonder if it's going to bite us that we don't have a way to > >>actually force vacuuming of the whole table (besides manually rm'ing the > >>VM). I've more than once seen VACUUM used to try to do some integrity > >>checking of the database. How are we actually going to test that the > >>feature works correctly? They'd have to write checks ontop of > >>pg_visibility to see whether things are borked. > > > >Let's add VACUUM (FORCE) or something like that. > > This is actually inverted. Vacuum by default should vacuum the entire > relation, however if we are going to keep the existing behavior of this > patch, VACUUM (FROZEN) seems to be better than (FORCE)? Prior to some 7.x release, VACUUM actually did what we ripped out in 9.0 release as VACUUM FULL. We actually changed the mode of operation quite heavily into the "lazy" mode which didn't acquire access exclusive lock, and it was a huge relief. I think that changing the mode of operation to be the lightest possible thing that gets the job done is convenient for users, because their existing scripts continue to clean their tables only they take less time. No need to tweak the maintenance scripts. I don't know what happens when the freeze_table_age threshold is reached. Do we scan the whole table when that happens? Because if we do, then we don't need a new keyword: just invoke the command after lowering the setting. Another question on this feature is what happens with the table age (relfrozenxid, relminmxid) when the table is not wholly scanned by vacuum. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres Freund wrote: > On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote: > > How do I test? > > > > Is there a script I can run? > > Unfortunately there's few interesting things to test with pre-made > scripts. There's no relevant OS dependency here, so each already > existing test doesn't really lead to significantly increased coverage > being run by other people. Generally, when testing for correctness > issues, it's often of limited benefit to run tests written by the author > of reviewer - such scripts will usually just test things either has > thought of. The dangerous areas are the ones neither author or reviewer > has considered. We touched this question in connection with multixact freezing and wraparound. Testers seem to want to be given a script that they can install and run, then go for a beer and get back to a bunch of errors to report. But it doesn't work that way; writing a useful test script requires a lot of effort. Jeff Janes has done astounding work in these matters. (I don't think we credit him enough for that.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 05/06/2016 02:29 PM, Andres Freund wrote: > Hi, > > > On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote: >> How do I test? >> >> Is there a script I can run? > > Unfortunately there's few interesting things to test with pre-made > scripts. There's no relevant OS dependency here, so each already > existing test doesn't really lead to significantly increased coverage > being run by other people. Generally, when testing for correctness > issues, it's often of limited benefit to run tests written by the author > of reviewer - such scripts will usually just test things either has > thought of. The dangerous areas are the ones neither author or reviewer > has considered. I can't argue with that. > > >> Are there specific things I can do to try and break it? > > Upgrade clusters using pg_upgrade and make sure things like index only > scans still work and yield correct data. Set up workloads that involve > freezing, and check that less WAL (and not more!) is generated with 9.6 > than with 9.5. Make sure queries still work. > > >> What are we looking for exactly? > > Data corruption, efficiency problems. > I am really not trying to be difficult here but Data Corruption is an easy one... what is the metric we accept as an efficiency problem? > >> A lot of -hackers seem to forget that although we have 100 -hackers, we have >> 10000 "consultant/practitioners". Could I read the code and with a weekend >> of WTF and -hackers questions figure out what is going on, yes but a lot of >> people couldn't and I don't have the time. > > I think tests without reading the code are quite sensible and > important. And it perfectly makes sense to ask for information about > what to test. But fundamentally testing is a lot of work, as is writing > and reviewing code; unless you're really really good at destructive > testing, you won't find much in a 15 minute break. > Yes, this is true but with a proper testing framework, I don't need a 15 minute break. I need 1 hour to configure, the rest just "happens" and reports back. I have cycles to test, I have team members to help test (as do *lots* of other people) but sometimes we just get lost in how to help. > >> You want me (or people like me) to test more? Give us an easy way to >> do it. > > Useful additional testing and easy just don't go well together. By the > time I have made it easy I've done the testing that's needed. I don't know that I can agree with this. A proper harness allows you to execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I will not argue that it isn't easy to implement but I know it can be done. >> Otherwise, we do what we can, which is try and interface on the things that >> will directly and immediately affect us (like keywords and syntax). > > The amount of bikeshedding on -hackers steals energy and time for > actually working on stuff, including testing. So I have little sympathy > for the amount of bike shedding done. Insuring a reasonable and thought out interface for users is not bike shedding, it is at least as important and possibly more important than any feature we add. Sincerely, JD > > Greetings, > > Andres Freund > -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On 2016-05-06 14:39:57 -0700, Joshua D. Drake wrote: > > > What are we looking for exactly? > > > > Data corruption, efficiency problems. > > > > I am really not trying to be difficult here but Data Corruption is an easy > one... what is the metric we accept as an efficiency problem? That's indeed not easy to define. In this case I'd say vacuums taking longer, index only scans being slower, more WAL being generated would count? > > I think tests without reading the code are quite sensible and > > important. And it perfectly makes sense to ask for information about > > what to test. But fundamentally testing is a lot of work, as is writing > > and reviewing code; unless you're really really good at destructive > > testing, you won't find much in a 15 minute break. > > > > Yes, this is true but with a proper testing framework, I don't need a 15 > minute break. I need 1 hour to configure, the rest just "happens" and > reports back. That only works if somebody writes such tests. And in that case the tester having run will often suffice (until related changes are being made). I'm not arguing against introducing more tests into the codebase - I rather fervently for that. But that really isn't what's going to avoid issues like this feature (or multixact) causing problems, because those tests will just test what the author thought of. > > > You want me (or people like me) to test more? Give us an easy way to > > > do it. > > > > Useful additional testing and easy just don't go well together. By the > > time I have made it easy I've done the testing that's needed. > > I don't know that I can agree with this. A proper harness allows you to > execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I will > not argue that it isn't easy to implement but I know it can be done. The problem is that the contents of go.sh are the much more relevant part than the 8 hours. Greetings, Andres Freund
On 2016-05-06 18:36:52 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote: > > > How do I test? > > > > > > Is there a script I can run? > > > > Unfortunately there's few interesting things to test with pre-made > > scripts. There's no relevant OS dependency here, so each already > > existing test doesn't really lead to significantly increased coverage > > being run by other people. Generally, when testing for correctness > > issues, it's often of limited benefit to run tests written by the author > > of reviewer - such scripts will usually just test things either has > > thought of. The dangerous areas are the ones neither author or reviewer > > has considered. > > We touched this question in connection with multixact freezing and > wraparound. Testers seem to want to be given a script that they can > install and run, then go for a beer and get back to a bunch of errors to > report. But it doesn't work that way; writing a useful test script > requires a lot of effort. Right. And once written, often enough running it on a lot more instances only marginally increases the coverage. > Jeff Janes has done astounding work in these matters. (I don't think > we credit him enough for that.) +many.
On 05/06/2016 02:48 PM, Andres Freund wrote: > On 2016-05-06 14:39:57 -0700, Joshua D. Drake wrote: >> Yes, this is true but with a proper testing framework, I don't need a 15 >> minute break. I need 1 hour to configure, the rest just "happens" and >> reports back. > > That only works if somebody writes such tests. Agreed. > And in that case the > tester having run will often suffice (until related changes are being > made). I'm not arguing against introducing more tests into the codebase > - I rather fervently for that. But that really isn't what's going to > avoid issues like this feature (or multixact) causing problems, because > those tests will just test what the author thought of. > Good point. I am not sure how to address the alternative though. > >>>> You want me (or people like me) to test more? Give us an easy way to >>>> do it. >>> >>> Useful additional testing and easy just don't go well together. By the >>> time I have made it easy I've done the testing that's needed. >> >> I don't know that I can agree with this. A proper harness allows you to >> execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I will >> not argue that it isn't easy to implement but I know it can be done. > > The problem is that the contents of go.sh are the much more relevant > part than the 8 hours. True. Please don't misunderstand, I am not saying this is "easy". I just hope that it is something we work for. Sincerely, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On 2016-05-06 18:31:03 -0300, Alvaro Herrera wrote: > I don't know what happens when the freeze_table_age threshold is > reached. We scan all non-frozen pages, whereas we earlier had to scan all pages. That's really both the significant benefit, and the danger. Because if we screw up the all-frozen bits in the visibilitymap, we'll be screwed soon after. > Do we scan the whole table when that happens? No, there's atm no way to force a whole-table vacuum, besides manually rm'ing the _vm fork. > Another question on this feature is what happens with the table age > (relfrozenxid, relminmxid) when the table is not wholly scanned by > vacuum. Basically we increase the horizons whenever scanning all pages that are not known to be frozen (+ potentially some frozen ones due to the skipping logic). Without that there'd really not be a point in the freeze map feature, as we'd continue to have the expensive anti wraparound vacuums. Andres
On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote: >> Jeff Janes has done astounding work in these matters. (I don't think >> we credit him enough for that.) > > +many. Agreed. I'm a huge fan of what Jeff has been able to do in this area. I often say so. It would be even better if Jeff's approach to testing was followed as an example by other people, but I wouldn't bet on it ever happening. It requires real persistence and deep understanding to do well. -- Peter Geoghegan
On Sat, May 7, 2016 at 8:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 2, 2016 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote: >> +static const uint8 number_of_ones_for_visible[256] = { >> ... >> +}; >> +static const uint8 number_of_ones_for_frozen[256] = { >> ... >> }; >> >> Did somebody verify the new contents are correct? > > I admit that I didn't. It seemed like an unlikely place for a goof, > but I guess we should verify. Looks correct. The tables match the output of the attached script. -- Thomas Munro http://www.enterprisedb.com
Attachment
On 2016-05-07 10:00:27 +1200, Thomas Munro wrote: > On Sat, May 7, 2016 at 8:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> Did somebody verify the new contents are correct? > > > > I admit that I didn't. It seemed like an unlikely place for a goof, > > but I guess we should verify. > > Looks correct. The tables match the output of the attached script. Great!
Alvaro Herrera wrote: > We touched this question in connection with multixact freezing and > wraparound. Testers seem to want to be given a script that they can > install and run, then go for a beer and get back to a bunch of errors to > report. Here I spent some time trying to explain what to test to try and find certain multixact bugs http://www.postgresql.org/message-id/20150605213832.GZ133018@postgresql.org -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > On 05/06/2016 01:58 PM, Stephen Frost wrote: >> >> * Joshua D. Drake (jd@commandprompt.com) wrote: >>> >>> Yeah I thought about that, it is the word "FORCE" that bothers me. >>> When you use FORCE there is an assumption that no matter what, it >>> plows through (think rm -f). So if we don't use FROZEN, that's cool >>> but FORCE doesn't work either. >> >> >> Isn't that exactly what this FORCE option being contemplated would do >> though? Plow through the entire relation, regardless of what the VM >> says is all frozen or not? >> >> Seems like FORCE is a good word for that to me. > > > Except that we aren't FORCING a vacuum. That is the part I have contention > with. To me, FORCE means: > > No matter what else is happening, we are vacuuming this relation (think > locks). > > But I am also not going to dig in my heals. If that is truly what -hackers > come up with, thank you at least considering what I said. > > Sincerely, > > JD > As Joshua mentioned, FORCE word might imply doing VACUUM while plowing through locks. I guess that it might confuse the users. IMO, since this option will be a way for emergency, SCANALL word works for me. Or other ideas are, VACUUM IGNOREVM VACUUM RESCURE Regards, -- Masahiko Sawada
On Sat, May 7, 2016 at 11:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <jd@commandprompt.com> wrote: >> On 05/06/2016 01:58 PM, Stephen Frost wrote: >>> >>> * Joshua D. Drake (jd@commandprompt.com) wrote: >>>> >>>> Yeah I thought about that, it is the word "FORCE" that bothers me. >>>> When you use FORCE there is an assumption that no matter what, it >>>> plows through (think rm -f). So if we don't use FROZEN, that's cool >>>> but FORCE doesn't work either. >>> >>> >>> Isn't that exactly what this FORCE option being contemplated would do >>> though? Plow through the entire relation, regardless of what the VM >>> says is all frozen or not? >>> >>> Seems like FORCE is a good word for that to me. >> >> >> Except that we aren't FORCING a vacuum. That is the part I have contention >> with. To me, FORCE means: >> >> No matter what else is happening, we are vacuuming this relation (think >> locks). >> >> But I am also not going to dig in my heals. If that is truly what -hackers >> come up with, thank you at least considering what I said. >> >> Sincerely, >> >> JD >> > > As Joshua mentioned, FORCE word might imply doing VACUUM while plowing > through locks. > I guess that it might confuse the users. > IMO, since this option will be a way for emergency, SCANALL word works for me. > > Or other ideas are, > VACUUM IGNOREVM > VACUUM RESCURE > Oops, VACUUM RESCUE is correct. Regards, -- Masahiko Sawada
On Sun, May 8, 2016 at 3:18 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, May 7, 2016 at 11:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <jd@commandprompt.com> wrote: >>> On 05/06/2016 01:58 PM, Stephen Frost wrote: >>>> >>>> * Joshua D. Drake (jd@commandprompt.com) wrote: >>>>> >>>>> Yeah I thought about that, it is the word "FORCE" that bothers me. >>>>> When you use FORCE there is an assumption that no matter what, it >>>>> plows through (think rm -f). So if we don't use FROZEN, that's cool >>>>> but FORCE doesn't work either. >>>> >>>> >>>> Isn't that exactly what this FORCE option being contemplated would do >>>> though? Plow through the entire relation, regardless of what the VM >>>> says is all frozen or not? >>>> >>>> Seems like FORCE is a good word for that to me. >>> >>> >>> Except that we aren't FORCING a vacuum. That is the part I have contention >>> with. To me, FORCE means: >>> >>> No matter what else is happening, we are vacuuming this relation (think >>> locks). >>> >>> But I am also not going to dig in my heals. If that is truly what -hackers >>> come up with, thank you at least considering what I said. >>> >>> Sincerely, >>> >>> JD >>> >> >> As Joshua mentioned, FORCE word might imply doing VACUUM while plowing >> through locks. >> I guess that it might confuse the users. >> IMO, since this option will be a way for emergency, SCANALL word works for me. >> >> Or other ideas are, >> VACUUM IGNOREVM >> VACUUM RESCURE >> > > Oops, VACUUM RESCUE is correct. > Attached draft patch adds SCANALL option to VACUUM in order to scan all pages forcibly while ignoring visibility map information. The option name is SCANALL for now but we could change it after got consensus. Regards, -- Masahiko Sawada
Attachment
On Tue, May 3, 2016 at 6:48 AM, Andres Freund <andres@anarazel.de> wrote: > fd31cd2 Don't vacuum all-frozen pages. - appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"), + appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"), vacrelstats->pages_removed, vacrelstats->rel_pages, - vacrelstats->pinskipped_pages); + vacrelstats->pinskipped_pages, + vacrelstats->frozenskipped_pages); The verbose information about skipping frozen page is emitted by only autovacuum. But I think that this information is also helpful for manual vacuum. Please find attached patch which fixes that. Regards, -- Masahiko Sawada
Attachment
On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached draft patch adds SCANALL option to VACUUM in order to scan > all pages forcibly while ignoring visibility map information. > The option name is SCANALL for now but we could change it after got consensus. If we're going to go that way, I'd say it should be scan_all rather than scanall. Makes it clearer, at least IMHO. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Attached draft patch adds SCANALL option to VACUUM in order to scan >> all pages forcibly while ignoring visibility map information. >> The option name is SCANALL for now but we could change it after got consensus. > > If we're going to go that way, I'd say it should be scan_all rather > than scanall. Makes it clearer, at least IMHO. Just to add some diversity to opinions, maybe there should be a separate command for performing integrity checks. Currently the best ways to actually verify database correctness do so as a side effect. The question that I get pretty much every time after I explain why we have data checksums, is "how do I check that they are correct" and we don't have a nice answer for that now. We could also use some ways to sniff out corrupted rows that don't involve crashing the server in a loop. Vacuuming pages that supposedly don't need vacuuming just to verify integrity seems very much in the same vein. I know right now isn't exactly the best time to hastily slap on such a feature, but I just wanted the thought to be out there for consideration. Regards, Ants Aasma
On Mon, May 9, 2016 at 7:40 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Attached draft patch adds SCANALL option to VACUUM in order to scan >>> all pages forcibly while ignoring visibility map information. >>> The option name is SCANALL for now but we could change it after got consensus. >> >> If we're going to go that way, I'd say it should be scan_all rather >> than scanall. Makes it clearer, at least IMHO. > > Just to add some diversity to opinions, maybe there should be a > separate command for performing integrity checks. Currently the best > ways to actually verify database correctness do so as a side effect. > The question that I get pretty much every time after I explain why we > have data checksums, is "how do I check that they are correct" and we > don't have a nice answer for that now. We could also use some ways to > sniff out corrupted rows that don't involve crashing the server in a > loop. Vacuuming pages that supposedly don't need vacuuming just to > verify integrity seems very much in the same vein. > > I know right now isn't exactly the best time to hastily slap on such a > feature, but I just wanted the thought to be out there for > consideration. I think that it's quite reasonable to have ways of performing an integrity check that are separate from VACUUM, but this is about having a way to force VACUUM to scan all-frozen pages - and it's hard to imagine that we want a different command name for that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 10, 2016 at 11:30 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 9, 2016 at 7:40 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: >> On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> Attached draft patch adds SCANALL option to VACUUM in order to scan >>>> all pages forcibly while ignoring visibility map information. >>>> The option name is SCANALL for now but we could change it after got consensus. >>> >>> If we're going to go that way, I'd say it should be scan_all rather >>> than scanall. Makes it clearer, at least IMHO. >> >> Just to add some diversity to opinions, maybe there should be a >> separate command for performing integrity checks. Currently the best >> ways to actually verify database correctness do so as a side effect. >> The question that I get pretty much every time after I explain why we >> have data checksums, is "how do I check that they are correct" and we >> don't have a nice answer for that now. We could also use some ways to >> sniff out corrupted rows that don't involve crashing the server in a >> loop. Vacuuming pages that supposedly don't need vacuuming just to >> verify integrity seems very much in the same vein. >> >> I know right now isn't exactly the best time to hastily slap on such a >> feature, but I just wanted the thought to be out there for >> consideration. > > I think that it's quite reasonable to have ways of performing an > integrity check that are separate from VACUUM, but this is about > having a way to force VACUUM to scan all-frozen pages Or second way I came up with is having tool to remove particular _vm file safely, which is executed via SQL or client tool like pg_resetxlog. Attached updated VACUUM SCAN_ALL patch. Please find it. Regards, -- Masahiko Sawada
Attachment
On 5/6/16 4:20 PM, Andres Freund wrote: > On 2016-05-06 14:15:47 -0700, Josh berkus wrote: >> For the serious testing, does anyone have a good technique for creating >> loads which would stress-test vacuum freezing? It's hard for me to come >> up with anything which wouldn't be very time-and-resource intensive >> (like running at 10,000 TPS for a week). > > I've changed the limits for freezing options a while back, so you can > now set autovacuum_freeze_max as low as 100000 (best set > vacuum_freeze_table_age accordingly). You'll have to come up with a > workload that doesn't overwrite all data continuously (otherwise > there'll never be old rows), but otherwise it should now be fairly easy > to test that kind of scenario. There's also been a tool for forcibly advancing XID floating around for quite some time. Using that could have the added benefit of verifying anti-wrap still works correctly. (Might be worth testing mxid wrap too...) -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 5/6/16 4:55 PM, Peter Geoghegan wrote: > On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote: >>> Jeff Janes has done astounding work in these matters. (I don't think >>> we credit him enough for that.) >> >> +many. > > Agreed. I'm a huge fan of what Jeff has been able to do in this area. > I often say so. It would be even better if Jeff's approach to testing > was followed as an example by other people, but I wouldn't bet on it > ever happening. It requires real persistence and deep understanding to > do well. It takes deep understanding to *design* the tests, not to write them. There's a lot of folks out there that will never understand enough to design tests meant to expose data corruption but who could easily code someone else's design, especially if we provided tools/ways to tweak a cluster to make testing easier/faster (such as artificially advancing XID/MXID). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 5/6/16 4:08 PM, Joshua D. Drake wrote: >>> >>> VACUUM THEWHOLEDAMNTHING >> >> +100 >> >> (hahahaha) > > You know what? Why not? Seriously? We aren't product. This is supposed > to be a bit fun. Let's have some fun with it? It would be so easy to > turn that into a positive advocacy opportunity. Honestly, for an option this obscure, I agree. I don't think we'd want any normally used stuff named so glibly, but I sure as heck could have used some easter-eggs like this when I was doing training. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 5/10/16 11:42 PM, Jim Nasby wrote: > On 5/6/16 4:55 PM, Peter Geoghegan wrote: >> On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote: >>>> Jeff Janes has done astounding work in these matters. (I don't think >>>> we credit him enough for that.) >>> >>> +many. >> >> Agreed. I'm a huge fan of what Jeff has been able to do in this area. >> I often say so. It would be even better if Jeff's approach to testing >> was followed as an example by other people, but I wouldn't bet on it >> ever happening. It requires real persistence and deep understanding to >> do well. > > It takes deep understanding to *design* the tests, not to write them. > There's a lot of folks out there that will never understand enough to > design tests meant to expose data corruption but who could easily code > someone else's design, especially if we provided tools/ways to tweak a > cluster to make testing easier/faster (such as artificially advancing > XID/MXID). Speaking of which, another email in the thread made me realize that there's a test condition no one has mentioned: verifying we don't lose tuples after wraparound. To test this, you'd want a table that's mostly frozen. Ideally, dirty a single tuple on a bunch of frozen pages, with committed updates, deletes, and un-committed inserts. Advance XID far enough to get you close to wrap-around. Do a vacuum, SELECT count(*), advance XID past wraparound, SELECT count(*) again and you should get the same number. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Tue, May 10, 2016 at 10:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Or second way I came up with is having tool to remove particular _vm > file safely, which is executed via SQL or client tool like > pg_resetxlog. > > Attached updated VACUUM SCAN_ALL patch. > Please find it. We should support scan_all only with the new-style options syntax for VACUUM; that is, vacuum (scan_all) rename. That doesn't require making scan_all a keyword, which is good: this is a minor feature, and we don't want to bloat the parsing tables for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 10, 2016 at 10:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Or second way I came up with is having tool to remove particular _vm >> file safely, which is executed via SQL or client tool like >> pg_resetxlog. >> >> Attached updated VACUUM SCAN_ALL patch. >> Please find it. > > We should support scan_all only with the new-style options syntax for > VACUUM; that is, vacuum (scan_all) rename. That doesn't require > making scan_all a keyword, which is good: this is a minor feature, and > we don't want to bloat the parsing tables for it. > I agree with having new-style options syntax. Isn't it better to have SCAN_ALL option without parentheses? Syntaxes are; VACUUM SCAN_ALL table_name; VACUUM SCAN_ALL; -- for all tables on database Regards, -- Masahiko Sawada
Masahiko Sawada wrote: > On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > We should support scan_all only with the new-style options syntax for > > VACUUM; that is, vacuum (scan_all) rename. That doesn't require > > making scan_all a keyword, which is good: this is a minor feature, and > > we don't want to bloat the parsing tables for it. > > I agree with having new-style options syntax. > Isn't it better to have SCAN_ALL option without parentheses? > > Syntaxes are; > VACUUM SCAN_ALL table_name; > VACUUM SCAN_ALL; -- for all tables on database No, I agree with Robert that we shouldn't add any more such options to avoid keyword proliferation. Syntaxes are;VACUUM (SCAN_ALL) table_name;VACUUM (SCAN_ALL); -- for all tables on database Is SCAN_ALL really the best we can do here? The business of having an underscore in an option name has no precedent (other than CURRENT_DATABASE and the like). How about COMPLETE, TOTAL, or WHOLE? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, May 17, 2016 at 3:32 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Masahiko Sawada wrote: >> On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> > We should support scan_all only with the new-style options syntax for >> > VACUUM; that is, vacuum (scan_all) rename. That doesn't require >> > making scan_all a keyword, which is good: this is a minor feature, and >> > we don't want to bloat the parsing tables for it. >> >> I agree with having new-style options syntax. >> Isn't it better to have SCAN_ALL option without parentheses? >> >> Syntaxes are; >> VACUUM SCAN_ALL table_name; >> VACUUM SCAN_ALL; -- for all tables on database > > No, I agree with Robert that we shouldn't add any more such options to > avoid keyword proliferation. > > Syntaxes are; > VACUUM (SCAN_ALL) table_name; > VACUUM (SCAN_ALL); -- for all tables on database Okay, I agree with this. > Is SCAN_ALL really the best we can do here? The business of having an > underscore in an option name has no precedent (other than > CURRENT_DATABASE and the like). Another way is having tool or function that removes _vm file safely for example. > How about COMPLETE, TOTAL, or WHOLE? IMHO, I don't have strong opinion about SCAN_ALL as long as we have document about that option and option name doesn't confuse users. But ISTM that COMPLETE, TOTAL might make users mislead normal vacuum as it doesn't do that completely. Regards, -- Masahiko Sawada
On 05/17/2016 12:32 PM, Alvaro Herrera wrote: > Syntaxes are; > VACUUM (SCAN_ALL) table_name; > VACUUM (SCAN_ALL); -- for all tables on database > > Is SCAN_ALL really the best we can do here? The business of having an > underscore in an option name has no precedent (other than > CURRENT_DATABASE and the like). How about COMPLETE, TOTAL, or WHOLE? > VACUUM (ANALYZE, VERBOSE, WHOLE) .... That seems reasonable? I agree that SCAN_ALL doesn't fit. I am not trying to pull a left turn but is there a technical reason we don't just make FULL do this? JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On Tue, May 17, 2016 at 4:34 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On 05/17/2016 12:32 PM, Alvaro Herrera wrote: > >> Syntaxes are; >> VACUUM (SCAN_ALL) table_name; >> VACUUM (SCAN_ALL); -- for all tables on database >> >> Is SCAN_ALL really the best we can do here? The business of having an >> underscore in an option name has no precedent (other than >> CURRENT_DATABASE and the like). How about COMPLETE, TOTAL, or WHOLE? >> > > VACUUM (ANALYZE, VERBOSE, WHOLE) > .... > > That seems reasonable? I agree that SCAN_ALL doesn't fit. I am not trying to > pull a left turn but is there a technical reason we don't just make FULL do > this? > FULL option requires AccessExclusiveLock, which could be a problem. Regards, -- Masahiko Sawada
On 17/05/16 21:32, Alvaro Herrera wrote: > Is SCAN_ALL really the best we can do here? The business of having an > underscore in an option name has no precedent (other than > CURRENT_DATABASE and the like). ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and IS_TEMPLATE. > How about COMPLETE, TOTAL, or WHOLE? Sure, I'll play this game. I like EXHAUSTIVE. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 18/05/16 09:34, Vik Fearing wrote: > On 17/05/16 21:32, Alvaro Herrera wrote: >> Is SCAN_ALL really the best we can do here? The business of having an >> underscore in an option name has no precedent (other than >> CURRENT_DATABASE and the like). > ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and > IS_TEMPLATE. > >> How about COMPLETE, TOTAL, or WHOLE? > Sure, I'll play this game. I like EXHAUSTIVE. I prefer 'WHOLE', as it seems more obvious (and not because of the pun relating to 'wholesomeness'!!!)
On Tue, May 17, 2016 at 5:47 PM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > On 18/05/16 09:34, Vik Fearing wrote: >> On 17/05/16 21:32, Alvaro Herrera wrote: >>> >>> Is SCAN_ALL really the best we can do here? The business of having an >>> underscore in an option name has no precedent (other than >>> CURRENT_DATABASE and the like). >> >> ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and >> IS_TEMPLATE. >> >>> How about COMPLETE, TOTAL, or WHOLE? >> >> Sure, I'll play this game. I like EXHAUSTIVE. > > I prefer 'WHOLE', as it seems more obvious (and not because of the pun > relating to 'wholesomeness'!!!) I think that users might believe that they need VACUUM (WHOLE) a lot more often than they will actually need this option. "Of course I want to vacuum my whole table!" I think we should give this a name that hints more strongly at this being an exceptional thing, like vacuum (even_frozen_pages). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5/18/16 6:37 AM, Robert Haas wrote: > On Tue, May 17, 2016 at 5:47 PM, Gavin Flower > <GavinFlower@archidevsys.co.nz> wrote: >> On 18/05/16 09:34, Vik Fearing wrote: >>> On 17/05/16 21:32, Alvaro Herrera wrote: >>>> >>>> Is SCAN_ALL really the best we can do here? The business of having an >>>> underscore in an option name has no precedent (other than >>>> CURRENT_DATABASE and the like). >>> >>> ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and >>> IS_TEMPLATE. >>> >>>> How about COMPLETE, TOTAL, or WHOLE? >>> >>> Sure, I'll play this game. I like EXHAUSTIVE. >> >> I prefer 'WHOLE', as it seems more obvious (and not because of the pun >> relating to 'wholesomeness'!!!) > > I think that users might believe that they need VACUUM (WHOLE) a lot > more often than they will actually need this option. "Of course I > want to vacuum my whole table!" > > I think we should give this a name that hints more strongly at this > being an exceptional thing, like vacuum (even_frozen_pages). How about just FROZEN? Perhaps it's too confusing to have that and FREEZE, but I thought I would throw it out there. -- -David david@pgmasters.net
On Wed, May 18, 2016 at 8:41 AM, David Steele <david@pgmasters.net> wrote: >> I think we should give this a name that hints more strongly at this >> being an exceptional thing, like vacuum (even_frozen_pages). > > How about just FROZEN? Perhaps it's too confusing to have that and FREEZE, > but I thought I would throw it out there. It's not a bad thought, but I do think it might be a bit confusing. My main priority for this new option is that people aren't tempted to use it very often, and I think a name like "even_frozen_pages" is more likely to accomplish that than just "frozen". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 05/18/2016 05:51 AM, Robert Haas wrote: > On Wed, May 18, 2016 at 8:41 AM, David Steele <david@pgmasters.net> wrote: >>> I think we should give this a name that hints more strongly at this >>> being an exceptional thing, like vacuum (even_frozen_pages). >> >> How about just FROZEN? Perhaps it's too confusing to have that and FREEZE, >> but I thought I would throw it out there. > > It's not a bad thought, but I do think it might be a bit confusing. > My main priority for this new option is that people aren't tempted to > use it very often, and I think a name like "even_frozen_pages" is more > likely to accomplish that than just "frozen". > freeze_all_pages? JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On Wed, May 18, 2016 at 9:42 AM, Joshua D. Drake <jd@commandprompt.com> wrote: >> It's not a bad thought, but I do think it might be a bit confusing. >> My main priority for this new option is that people aren't tempted to >> use it very often, and I think a name like "even_frozen_pages" is more >> likely to accomplish that than just "frozen". > > freeze_all_pages? No, that's what the existing FREEZE option does. This new option is about unnecessarily vacuuming pages that don't need it. The expectation is that vacuuming all-frozen pages will be a no-op. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com>:
No, that's what the existing FREEZE option does. This new option is
about unnecessarily vacuuming pages that don't need it. The
expectation is that vacuuming all-frozen pages will be a no-op.
VACUUM (INCLUDING ALL) ?
Victor Y. Yegorov
On 05/18/2016 09:55 AM, Victor Yegorov wrote: > 2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>>: > > No, that's what the existing FREEZE option does. This new option is > about unnecessarily vacuuming pages that don't need it. The > expectation is that vacuuming all-frozen pages will be a no-op. > > > VACUUM (INCLUDING ALL) ? VACUUM (FORCE ALL) ? Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
On Wed, May 18, 2016 at 7:09 AM, Joe Conway <mail@joeconway.com> wrote: > On 05/18/2016 09:55 AM, Victor Yegorov wrote: >> 2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com >> <mailto:robertmhaas@gmail.com>>: >> >> No, that's what the existing FREEZE option does. This new option is >> about unnecessarily vacuuming pages that don't need it. The >> expectation is that vacuuming all-frozen pages will be a no-op. >> >> >> VACUUM (INCLUDING ALL) ? > > VACUUM (FORCE ALL) ? How about going with something that says more about why we are doing it, rather than trying to describe in one or two words what it is doing? VACUUM (FORENSIC) VACUUM (DEBUG) VACUUM (LINT) Cheers, Jeff
On Wed, May 18, 2016 at 8:52 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > How about going with something that says more about why we are doing > it, rather than trying to describe in one or two words what it is > doing? > > VACUUM (FORENSIC) > > VACUUM (DEBUG) > > VACUUM (LINT) +1 -- Peter Geoghegan
On 05/18/2016 03:51 PM, Peter Geoghegan wrote: > On Wed, May 18, 2016 at 8:52 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> How about going with something that says more about why we are doing >> it, rather than trying to describe in one or two words what it is >> doing? >> >> VACUUM (FORENSIC) >> >> VACUUM (DEBUG) >> >> VACUUM (LINT) > > +1 Maybe this is the wrong perspective. I mean, is there a reason we even need this option, other than a lack of any other way to do a full table scan to check for corruption, etc.? If we're only doing this for integrity checking, then maybe it's better if it becomes a function, which could be later extended with additional forensic features? -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
Josh berkus <josh@agliodbs.com> writes: > Maybe this is the wrong perspective. I mean, is there a reason we even > need this option, other than a lack of any other way to do a full table > scan to check for corruption, etc.? If we're only doing this for > integrity checking, then maybe it's better if it becomes a function, > which could be later extended with additional forensic features? Yes, I've been wondering that too. VACUUM is not meant as a corruption checker, and should not be made into one, so what is the point of this flag exactly? (AFAIK, "select count(*) from table" would offer a similar amount of sanity checking as a full-table VACUUM scan does, so it's not like we've removed functionality with no near-term replacement.) regards, tom lane
On 2016-05-18 18:25:39 -0400, Tom Lane wrote: > Josh berkus <josh@agliodbs.com> writes: > > Maybe this is the wrong perspective. I mean, is there a reason we even > > need this option, other than a lack of any other way to do a full table > > scan to check for corruption, etc.? If we're only doing this for > > integrity checking, then maybe it's better if it becomes a function, > > which could be later extended with additional forensic features? > > Yes, I've been wondering that too. VACUUM is not meant as a corruption > checker, and should not be made into one, so what is the point of this > flag exactly? Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age = 0) verified the correctness of the visibility map; and that found a number of bugs. Now visibilitymap grew additional responsibilities, with a noticeable risk of data eating bugs, and there's no way to verify whether visibilitymap's frozen bits are set correctly. > (AFAIK, "select count(*) from table" would offer a similar amount of > sanity checking as a full-table VACUUM scan does, so it's not like > we've removed functionality with no near-term replacement.) I don't think that'd do anything comparable to /* * As of PostgreSQL 9.2, the visibility map bit should never be setif * the page-level bit is clear. However, it's possible that the bit * got cleared after we checked it and beforewe took the buffer * content lock, so we must recheck before jumping to the conclusion * that something badhas happened. */ else if (all_visible_according_to_vm && !PageIsAllVisible(page) && VM_ALL_VISIBLE(onerel,blkno, &vmbuffer)) { elog(WARNING, "page is not marked all-visible but visibility map bitis set in relation \"%s\" page %u", relname, blkno); visibilitymap_clear(onerel, blkno, vmbuffer); } If we had a checking module for all this it'd possibly be sufficient, but we don't. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2016-05-18 18:25:39 -0400, Tom Lane wrote: >> Yes, I've been wondering that too. VACUUM is not meant as a corruption >> checker, and should not be made into one, so what is the point of this >> flag exactly? > Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age = > 0) verified the correctness of the visibility map; and that found a > number of bugs. Now visibilitymap grew additional responsibilities, > with a noticeable risk of data eating bugs, and there's no way to verify > whether visibilitymap's frozen bits are set correctly. Meh. I'm not sure we should grow a rather half-baked feature we'll never be able to remove as a substitute for a separate sanity checker. The latter is really the right place for this kind of thing. regards, tom lane
On 2016-05-18 18:42:16 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2016-05-18 18:25:39 -0400, Tom Lane wrote: > >> Yes, I've been wondering that too. VACUUM is not meant as a corruption > >> checker, and should not be made into one, so what is the point of this > >> flag exactly? > > > Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age = > > 0) verified the correctness of the visibility map; and that found a > > number of bugs. Now visibilitymap grew additional responsibilities, > > with a noticeable risk of data eating bugs, and there's no way to verify > > whether visibilitymap's frozen bits are set correctly. > > Meh. I'm not sure we should grow a rather half-baked feature we'll never > be able to remove as a substitute for a separate sanity checker. The > latter is really the right place for this kind of thing. It's not a new feature, it's a feature we removed as a side effect. And one that allows us to evaluate whether the new feature actually works.
Andres Freund wrote: > > (AFAIK, "select count(*) from table" would offer a similar amount of > > sanity checking as a full-table VACUUM scan does, so it's not like > > we've removed functionality with no near-term replacement.) > > I don't think that'd do anything comparable to > /* > * As of PostgreSQL 9.2, the visibility map bit should never be set if > * the page-level bit is clear. However, it's possible that the bit > * got cleared after we checked it and before we took the buffer > * content lock, so we must recheck before jumping to the conclusion > * that something bad has happened. > */ > else if (all_visible_according_to_vm && !PageIsAllVisible(page) > && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer)) > { > elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u", > relname, blkno); > visibilitymap_clear(onerel, blkno, vmbuffer); > } > > If we had a checking module for all this it'd possibly be sufficient, > but we don't. Here's an idea. We need core-blessed extensions (src/extensions/, you know I've proposed this before), so why not take this opportunity to create our first such and make it carry a function to scan a table completely to do this task. Since we were considering a new VACUUM option, surely this is serious enough to warrant more than just contrib. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Since we were considering a new VACUUM option, surely this is serious > enough to warrant more than just contrib. I would like to see us consider the long-term best place for amcheck's functionality at the same time. Ideally, verification would be a somewhat generic operation, with AM-specific code invoked as appropriate. -- Peter Geoghegan
On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote: > On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-05-02 14:48:18 -0700, Andres Freund wrote: > > + char new_vmbuf[BLCKSZ]; > > + char *new_cur = new_vmbuf; > > + bool empty = true; > > + bool old_lastpart; > > + > > + /* Copy page header in advance */ > > + memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData); > > > > Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it > > with old_lastpart && !empty, right? > > Oh, dear. That seems like a possible data corruption bug. Maybe we'd > better fix that right away (although I don't actually have time before > the wrap). [This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Robert, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a 9.6 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within 72 hours of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping 9.6rc1. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com
On Sun, May 29, 2016 at 2:44 PM, Noah Misch <noah@leadboat.com> wrote: > On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote: >> On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-05-02 14:48:18 -0700, Andres Freund wrote: >> > + char new_vmbuf[BLCKSZ]; >> > + char *new_cur = new_vmbuf; >> > + bool empty = true; >> > + bool old_lastpart; >> > + >> > + /* Copy page header in advance */ >> > + memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData); >> > >> > Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it >> > with old_lastpart && !empty, right? >> >> Oh, dear. That seems like a possible data corruption bug. Maybe we'd >> better fix that right away (although I don't actually have time before >> the wrap). > > [This is a generic notification.] > > The above-described topic is currently a PostgreSQL 9.6 open item. Robert, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > 9.6 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within 72 hours of this > message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping 9.6rc1. Consequently, I will appreciate your > efforts toward speedy resolution. Thanks. > > [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com Thank you for notification. Regarding check tool for visibility map is still under the discussion. I'm going to address other review comments, and send the patch ASAP. Regards, -- Masahiko Sawada
On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Andres Freund wrote: > >> >> If we had a checking module for all this it'd possibly be sufficient, >> but we don't. > > Here's an idea. We need core-blessed extensions (src/extensions/, you > know I've proposed this before), so why not take this opportunity to > create our first such and make it carry a function to scan a table > completely to do this task. > > Since we were considering a new VACUUM option, surely this is serious > enough to warrant more than just contrib. What does "core-blessed" mean? The commit rights for contrib/ are the same as they are for src/ Cheers, Jeff
On Tue, May 31, 2016 at 4:40 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Andres Freund wrote: >> >>> >>> If we had a checking module for all this it'd possibly be sufficient, >>> but we don't. >> >> Here's an idea. We need core-blessed extensions (src/extensions/, you >> know I've proposed this before), so why not take this opportunity to >> create our first such and make it carry a function to scan a table >> completely to do this task. >> >> Since we were considering a new VACUUM option, surely this is serious >> enough to warrant more than just contrib. > > What does "core-blessed" mean? The commit rights for contrib/ are the > same as they are for src/ Personally I understand contrib/ modules as third-part plugins that are considered as not enough mature to be part of src/backend or src/bin, but one day they could become so. See pg_upgrade's recent move for example. src/extensions/ includes third-part plugins that are thought as useful, are part of the main server package, but are not something that we want to enable by default. -- Michael
On Sun, May 29, 2016 at 1:44 AM, Noah Misch <noah@leadboat.com> wrote: > On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote: >> On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-05-02 14:48:18 -0700, Andres Freund wrote: >> > + char new_vmbuf[BLCKSZ]; >> > + char *new_cur = new_vmbuf; >> > + bool empty = true; >> > + bool old_lastpart; >> > + >> > + /* Copy page header in advance */ >> > + memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData); >> > >> > Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it >> > with old_lastpart && !empty, right? >> >> Oh, dear. That seems like a possible data corruption bug. Maybe we'd >> better fix that right away (although I don't actually have time before >> the wrap). > > [This is a generic notification.] > > The above-described topic is currently a PostgreSQL 9.6 open item. Robert, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > 9.6 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within 72 hours of this > message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping 9.6rc1. Consequently, I will appreciate your > efforts toward speedy resolution. Thanks. I am going to try to find time to look at this later this week, but realistically it's going to be a little bit difficult to find that time. I was away over Memorial Day weekend and was in meetings most of today. I have a huge pile of email to catch up on. I will send another status update no later than Friday. If Andres or anyone else wants to jump in and fix this up meanwhile, that would be great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, May 7, 2016 at 5:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 2, 2016 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote: >> + * heap_tuple_needs_eventual_freeze >> + * >> + * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac) >> + * will eventually require freezing. Similar to heap_tuple_needs_freeze, >> + * but there's no cutoff, since we're trying to figure out whether freezing >> + * will ever be needed, not whether it's needed now. >> + */ >> +bool >> +heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple) >> >> Wouldn't redefining this to heap_tuple_is_frozen() and then inverting the >> checks be easier to understand? > > I thought it much safer to keep this as close to a copy of > heap_tuple_needs_freeze() as possible. Copying a function and > inverting all of the return values is much more likely to introduce > bugs, IME. I agree. >> + /* >> + * If xmax is a valid xact or multixact, this tuple is also not frozen. >> + */ >> + if (tuple->t_infomask & HEAP_XMAX_IS_MULTI) >> + { >> + MultiXactId multi; >> + >> + multi = HeapTupleHeaderGetRawXmax(tuple); >> + if (MultiXactIdIsValid(multi)) >> + return true; >> + } >> >> Hm. What's the test inside the if() for? There shouldn't be any case >> where xmax is invalid if HEAP_XMAX_IS_MULTI is set. Now there's a >> check like that outside of this commit, but it seems strange to me >> (Alvaro, perhaps you could comment on this?). > > Here again I was copying existing code, with appropriate simplifications. > >> + * >> + * Clearing both visibility map bits is not separately WAL-logged. The callers >> * must make sure that whenever a bit is cleared, the bit is cleared on WAL >> * replay of the updating operation as well. >> >> I think including "both" here makes things less clear, because it >> differentiates clearing one bit from clearing both. There's no practical >> differentce atm, but still. > > I agree. Fixed. >> * >> * VACUUM will normally skip pages for which the visibility map bit is set; >> * such pages can't contain any dead tuples and therefore don't need vacuuming. >> - * The visibility map is not used for anti-wraparound vacuums, because >> - * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid >> - * present in the table, even on pages that don't have any dead tuples. >> * >> >> I think the remaining sentence isn't entirely accurate, there's now more >> than one bit, and they're different with regard to scan_all/!scan_all >> vacuums (or will be - maybe this updated further in a later commit? But >> if so, that sentence shouldn't yet be removed...). > > We can adjust the language, but I don't really see a big problem here. This comment is not incorporate this patch so far. >> -/* Number of heap blocks we can represent in one byte. */ >> -#define HEAPBLOCKS_PER_BYTE 8 >> - >> Hm, why was this moved to the header? Sounds like something the outside >> shouldn't care about. > > Oh... yeah. Let's undo that. Fixed. >> #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK) >> >> Hm. This isn't really a mapping to an individual bit anymore - but I >> don't really have a better name in mind. Maybe TO_OFFSET? > > Well, it sorta is... but we could change it, I suppose. > >> +static const uint8 number_of_ones_for_visible[256] = { >> ... >> +}; >> +static const uint8 number_of_ones_for_frozen[256] = { >> ... >> }; >> >> Did somebody verify the new contents are correct? > > I admit that I didn't. It seemed like an unlikely place for a goof, > but I guess we should verify. >> /* >> - * visibilitymap_clear - clear a bit in visibility map >> + * visibilitymap_clear - clear all bits in visibility map >> * >> >> This seems rather easy to misunderstand, as this really only clears all >> the bits for one page, not actually all the bits. > > We could change "in" to "for one page in the". Fixed. >> * the bit for heapBlk, or InvalidBuffer. The caller is responsible for >> - * releasing *buf after it's done testing and setting bits. >> + * releasing *buf after it's done testing and setting bits, and must pass flags >> + * for which it needs to check the value in visibility map. >> * >> * NOTE: This function is typically called without a lock on the heap page, >> * so somebody else could change the bit just after we look at it. In fact, >> @@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, >> >> I'm not seing what flags the above comment change is referring to? > > Ugh. I think that's leftover cruft from an earlier patch version that > should have been excised from what got committed. Fixed. >> /* >> - * A single-bit read is atomic. There could be memory-ordering effects >> + * A single byte read is atomic. There could be memory-ordering effects >> * here, but for performance reasons we make it the caller's job to worry >> * about that. >> */ >> - result = (map[mapByte] & (1 << mapBit)) ? true : false; >> - >> - return result; >> + return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS); >> } >> >> Not a new issue, and *very* likely to be irrelevant in practice (given >> the value is only referenced once): But there's really no guarantee >> map[mapByte] is only read once here. > > Meh. But we can fix if you want to. Fixed. >> -BlockNumber >> -visibilitymap_count(Relation rel) >> +void >> +visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen) >> >> Not really a new issue again: The parameter types (previously return >> type) to this function seem wrong to me. > > Not this patch's job to tinker. This comment is not incorporate this patch yet. >> @@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut >> } >> + /* >> + * We don't bother clearing *all_frozen when the page is discovered not >> + * to be all-visible, so do that now if necessary. The page might fail >> + * to be all-frozen for other reasons anyway, but if it's not all-visible, >> + * then it definitely isn't all-frozen. >> + */ >> + if (!all_visible) >> + *all_frozen = false; >> + >> >> Why don't we just set *all_frozen to false when appropriate? It'd be >> just as many lines and probably easier to understand? > > I thought that looked really easy to mess up, either now or down the > road. This way seemed more solid to me. That's a judgement call, of > course. To be understanding easier, I changed it so. >> + /* >> + * If the page is marked as all-visible but not all-frozen, we should >> + * so mark it. Note that all_frozen is only valid if all_visible is >> + * true, so we must check both. >> + */ >> >> This kinda seems to imply that all-visible implies all_frozen. Also, why >> has that block been added to the end of the if/else if chain? Seems like >> it belongs below the (all_visible && !all_visible_according_to_vm) block. > > We can adjust the comment a bit to make it more clear, if you like, > but I doubt it's going to cause serious misunderstanding. As for the > placement, the reason I put it at the end is because I figured that we > did not want to mark it all-frozen if any of the "oh crap, emit a > warning" cases applied. > Fixed comment. I think that we should care about all-visible problem first, and then care all-frozen problem. So this patch doesn't change the placement. Attached patch fixes only above comments, other are being addressed now. -- Regards, -- Masahiko Sawada
Attachment
On Sat, May 7, 2016 at 5:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-05-02 14:48:18 -0700, Andres Freund wrote: >>> 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN. >> >> Nothing to say here. >> >>> fd31cd2 Don't vacuum all-frozen pages. >> >> Hm. I do wonder if it's going to bite us that we don't have a way to >> actually force vacuuming of the whole table (besides manually rm'ing the >> VM). I've more than once seen VACUUM used to try to do some integrity >> checking of the database. How are we actually going to test that the >> feature works correctly? They'd have to write checks ontop of >> pg_visibility to see whether things are borked. > > Let's add VACUUM (FORCE) or something like that. > >> /* >> * Compute whether we actually scanned the whole relation. If we did, we >> * can adjust relfrozenxid and relminmxid. >> * >> * NB: We need to check this before truncating the relation, because that >> * will change ->rel_pages. >> */ >> >> Comment is out-of-date now. > > OK. Fixed. >> - if (blkno == next_not_all_visible_block) >> + if (blkno == next_unskippable_block) >> { >> - /* Time to advance next_not_all_visible_block */ >> - for (next_not_all_visible_block++; >> - next_not_all_visible_block < nblocks; >> - next_not_all_visible_block++) >> + /* Time to advance next_unskippable_block */ >> + for (next_unskippable_block++; >> + next_unskippable_block < nblocks; >> + next_unskippable_block++) >> >> Hm. So we continue with the course of re-processing pages, even if >> they're marked all-frozen. For all-visible there at least can be a >> benefit by freezing earlier, but for all-frozen pages there's really no >> point. I don't really buy the arguments for the skipping logic. But >> even disregarding that, maybe we should skip processing a block if it's >> all-frozen (without preventing the page from being read?); as there's no >> possible benefit? Acquring the exclusive/content lock and stuff is far >> from free. > > I wanted to tinker with this logic as little as possible in the > interest of ending up with something that worked. I would not have > written it this way. > >> Not really related to this patch, but the FORCE_CHECK_PAGE is rather >> ugly. > > +1. >> + /* >> + * The current block is potentially skippable; if we've seen a >> + * long enough run of skippable blocks to justify skipping it, and >> + * we're not forced to check it, then go ahead and skip. >> + * Otherwise, the page must be at least all-visible if not >> + * all-frozen, so we can set all_visible_according_to_vm = true. >> + */ >> + if (skipping_blocks && !FORCE_CHECK_PAGE()) >> + { >> + /* >> + * Tricky, tricky. If this is in aggressive vacuum, the page >> + * must have been all-frozen at the time we checked whether it >> + * was skippable, but it might not be any more. We must be >> + * careful to count it as a skipped all-frozen page in that >> + * case, or else we'll think we can't update relfrozenxid and >> + * relminmxid. If it's not an aggressive vacuum, we don't >> + * know whether it was all-frozen, so we have to recheck; but >> + * in this case an approximate answer is OK. >> + */ >> + if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer)) >> + vacrelstats->frozenskipped_pages++; >> continue; >> + } >> >> Hm. This indeed seems a bit tricky. Not sure how to make it easier >> though without just ripping out the SKIP_PAGES_THRESHOLD stuff. > > Yep, I had the same problem. >> Hm. This also doubles the number of VM accesses. While I guess that's >> not noticeable most of the time, it's still not nice; especially when a >> large relation is entirely frozen, because it'll mean we'll sequentially >> go through the visibilityma twice. > > Compared to what we're saving, that's obviously a trivial cost. > That's not to say that we might not want to improve it, but it's > hardly a disaster. > > In short: wah, wah, wah. > Attached patch optimises skipping pages logic so that blkno can jump to next_unskippable_block directly while counting the number of all_visible and all_frozen pages. So we can avoid double checking visibility map. Regards, -- Masahiko Sawada
Attachment
On Wed, Jun 1, 2016 at 3:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached patch fixes only above comments, other are being addressed now. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 2, 2016 at 11:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached patch optimises skipping pages logic so that blkno can jump to > next_unskippable_block directly while counting the number of all_visible > and all_frozen pages. So we can avoid double checking visibility map. I think this is 9.7 material. This patch has already won the "scariest patch" tournament. Changing the logic more than necessary at this late date seems like it just increases the scariness. I think this is an opportunity for further optimization, not a defect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 3, 2016 at 11:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 2, 2016 at 11:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Attached patch optimises skipping pages logic so that blkno can jump to >> next_unskippable_block directly while counting the number of all_visible >> and all_frozen pages. So we can avoid double checking visibility map. > > I think this is 9.7 material. This patch has already won the > "scariest patch" tournament. Changing the logic more than necessary > at this late date seems like it just increases the scariness. I think > this is an opportunity for further optimization, not a defect. > I agree with you. I'll submit this as a improvement for 9.7. That patch also incorporates the following review comment. We can push at least this fix. >> /* >> * Compute whether we actually scanned the whole relation. If we did, we >> * can adjust relfrozenxid and relminmxid. >> * >> * NB: We need to check this before truncating the relation, because that >> * will change ->rel_pages. >> */ >> >> Comment is out-of-date now. I'm address the review comment of 7087166 commit, and will post the patch. And testing feature for freeze map is under the discussion. Regards, -- Masahiko Sawada
On Fri, Jun 3, 2016 at 10:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > That patch also incorporates the following review comment. > We can push at least this fix. Can you submit that part as a separate patch? > I'm address the review comment of 7087166 commit, and will post the patch. When? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jun 4, 2016 at 12:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 3, 2016 at 10:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> That patch also incorporates the following review comment. >> We can push at least this fix. > > Can you submit that part as a separate patch? Attached. >> I'm address the review comment of 7087166 commit, and will post the patch. > > When? > On Saturday. Regards, -- Masahiko Sawada
Attachment
On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Can you submit that part as a separate patch? > > Attached. Thanks, committed. >>> I'm address the review comment of 7087166 commit, and will post the patch. >> >> When? > > On Saturday. Great. Will that address everything for this open item, then? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, May 7, 2016 at 5:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-05-02 14:48:18 -0700, Andres Freund wrote: >>> 7087166 pg_upgrade: Convert old visibility map format to new format. >> >> +const char * >> +rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force) >> ... >> >> + while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ) >> + { >> .. >> >> Uh, shouldn't we actually fail if we read incompletely? Rather than >> silently ignoring the problem? Ok, this causes no corruption, but it >> indicates that something went significantly wrong. > > Sure, that's reasonable. > Fixed. >> + char new_vmbuf[BLCKSZ]; >> + char *new_cur = new_vmbuf; >> + bool empty = true; >> + bool old_lastpart; >> + >> + /* Copy page header in advance */ >> + memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData); >> >> Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it >> with old_lastpart && !empty, right? > > Oh, dear. That seems like a possible data corruption bug. Maybe we'd > better fix that right away (although I don't actually have time before > the wrap). Since the force is always set true, I removed the force from argument of copyFile() and rewriteVisibilityMap(). And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags . >> + if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0) >> + { >> + close(src_fd); >> + return getErrorText(); >> + } >> >> I know you guys copied this, but what's the force thing about? >> Expecially as it's always set to true by the callers (i.e. what is the >> parameter even about?)? Wouldn't we at least have to specify O_TRUNC in >> the force case? > > I just work here. > >> + old_cur += BITS_PER_HEAPBLOCK_OLD; >> + new_cur += BITS_PER_HEAPBLOCK; >> >> I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD >> stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not >> be able to have differing ones anyway, should we decide to add a third >> bit? > > I think that's just a matter of style. So this comments is not incorporated. Attached patch, please review it. Regards, -- Masahiko Sawada
Attachment
On Fri, Jun 3, 2016 at 10:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> + char new_vmbuf[BLCKSZ]; >>> + char *new_cur = new_vmbuf; >>> + bool empty = true; >>> + bool old_lastpart; >>> + >>> + /* Copy page header in advance */ >>> + memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData); >>> >>> Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it >>> with old_lastpart && !empty, right? >> >> Oh, dear. That seems like a possible data corruption bug. Maybe we'd >> better fix that right away (although I don't actually have time before >> the wrap). Actually, on second thought, I'm not seeing the bug here. It seems to me that the loop commented this way: /* Process old page bytes one by one, and turn it into new page. */ ...should always write to every byte in new_vmbuf, because we process exactly half the bytes in the old block at a time, and so that's going to generate exactly one full page of new bytes. Am I missing something? > Since the force is always set true, I removed the force from argument > of copyFile() and rewriteVisibilityMap(). > And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags . I'm not happy with this. I think we should always open with O_EXCL, because the new file is not expected to exist and if it does, something's probably broken. I think we should default to the safe behavior (which is failing) rather than the unsafe behavior (which is clobbering data). (Status update for Noah: I expect Masahiko Sawada will respond quickly, but if not I'll give some kind of update by Monday COB anyhow.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jun 4, 2016 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 3, 2016 at 10:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> + char new_vmbuf[BLCKSZ]; >>>> + char *new_cur = new_vmbuf; >>>> + bool empty = true; >>>> + bool old_lastpart; >>>> + >>>> + /* Copy page header in advance */ >>>> + memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData); >>>> >>>> Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it >>>> with old_lastpart && !empty, right? >>> >>> Oh, dear. That seems like a possible data corruption bug. Maybe we'd >>> better fix that right away (although I don't actually have time before >>> the wrap). > > Actually, on second thought, I'm not seeing the bug here. It seems to > me that the loop commented this way: > > /* Process old page bytes one by one, and turn it into new page. */ > > ...should always write to every byte in new_vmbuf, because we process > exactly half the bytes in the old block at a time, and so that's going > to generate exactly one full page of new bytes. Am I missing > something? Yeah, you're right. the rewriteVisibilityMap() always exactly writes whole new_vmbuf. > >> Since the force is always set true, I removed the force from argument >> of copyFile() and rewriteVisibilityMap(). >> And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags . > > I'm not happy with this. I think we should always open with O_EXCL, > because the new file is not expected to exist and if it does, > something's probably broken. I think we should default to the safe > behavior (which is failing) rather than the unsafe behavior (which is > clobbering data). I specified O_EXCL instead of O_TRUNC. Attached updated patch. Regards, -- Masahiko Sawada
Attachment
On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Can you submit that part as a separate patch? >> >> Attached. > > Thanks, committed. > >>>> I'm address the review comment of 7087166 commit, and will post the patch. >>> >>> When? >> >> On Saturday. > > Great. Will that address everything for this open item, then? > Attached patch for commit 7087166 on another mail. I think that only the test tool for visibility map is remaining and under the discussion. Even if we have verification tool or function for visibility map, we cannot repair the contents of visibility map if we turned out that contents of visibility map is something wrong. So I think we should have the way that re-generates the visibility map. For this purpose, doing vacuum while ignoring visibility map by a new option or new function is one idea. But IMHO, it's not good idea to allow a function to do vacuum, and expanding the VACUUM syntax might be somewhat overkill. So other idea is to have GUC parameter, vacuum_even_frozen_page for example. If this parameter is set true (false by default), we do vacuum whole table forcibly and re-generate visibility map. The advantage of this idea is that we don't necessary to expand VACUUM syntax and relatively easily can remove this parameter if it's not necessary anymore. Thought? Regards, -- Masahiko Sawada
On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> Can you submit that part as a separate patch? >>> >>> Attached. >> >> Thanks, committed. >> >>>>> I'm address the review comment of 7087166 commit, and will post the patch. >>>> >>>> When? >>> >>> On Saturday. >> >> Great. Will that address everything for this open item, then? >> > > Attached patch for commit 7087166 on another mail. > I think that only the test tool for visibility map is remaining and > under the discussion. > Even if we have verification tool or function for visibility map, we > cannot repair the contents of visibility map if we turned out that > contents of visibility map is something wrong. > So I think we should have the way that re-generates the visibility map. > For this purpose, doing vacuum while ignoring visibility map by a new > option or new function is one idea. > But IMHO, it's not good idea to allow a function to do vacuum, and > expanding the VACUUM syntax might be somewhat overkill. > > So other idea is to have GUC parameter, vacuum_even_frozen_page for example. > If this parameter is set true (false by default), we do vacuum whole > table forcibly and re-generate visibility map. > The advantage of this idea is that we don't necessary to expand VACUUM > syntax and relatively easily can remove this parameter if it's not > necessary anymore. > Attached is a sample patch that controls full page vacuum by new GUC parameter. Regards, -- Masahiko Sawada
Attachment
On Mon, Jun 6, 2016 at 5:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> Can you submit that part as a separate patch? >>>> >>>> Attached. >>> >>> Thanks, committed. >>> >>>>>> I'm address the review comment of 7087166 commit, and will post the patch. >>>>> >>>>> When? >>>> >>>> On Saturday. >>> >>> Great. Will that address everything for this open item, then? >>> >> >> Attached patch for commit 7087166 on another mail. >> I think that only the test tool for visibility map is remaining and >> under the discussion. >> Even if we have verification tool or function for visibility map, we >> cannot repair the contents of visibility map if we turned out that >> contents of visibility map is something wrong. >> So I think we should have the way that re-generates the visibility map. >> For this purpose, doing vacuum while ignoring visibility map by a new >> option or new function is one idea. >> But IMHO, it's not good idea to allow a function to do vacuum, and >> expanding the VACUUM syntax might be somewhat overkill. >> >> So other idea is to have GUC parameter, vacuum_even_frozen_page for example. >> If this parameter is set true (false by default), we do vacuum whole >> table forcibly and re-generate visibility map. >> The advantage of this idea is that we don't necessary to expand VACUUM >> syntax and relatively easily can remove this parameter if it's not >> necessary anymore. >> > > Attached is a sample patch that controls full page vacuum by new GUC parameter. Don't we want a reloption for that? Just wondering... -- Michael
On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> Attached is a sample patch that controls full page vacuum by new GUC parameter. > > Don't we want a reloption for that? Just wondering... Why? Just for consistency? I think the bigger question here is whether we need to do anything at all. It's true that, without some new option, we'll lose the ability to forcibly vacuum every page in the relation, even if all-frozen. But there's not much use case for that in the first place. It will be potentially helpful if it turns out that we have a bug that sets the all-frozen bit on pages that are not, in fact, all-frozen. Otherwise, what's the use? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 6, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >>> Attached is a sample patch that controls full page vacuum by new GUC parameter. >> >> Don't we want a reloption for that? Just wondering... > > Why? Just for consistency? I think the bigger question here is > whether we need to do anything at all. It's true that, without some > new option, we'll lose the ability to forcibly vacuum every page in > the relation, even if all-frozen. But there's not much use case for > that in the first place. It will be potentially helpful if it turns > out that we have a bug that sets the all-frozen bit on pages that are > not, in fact, all-frozen. Otherwise, what's the use? > I cannot agree with using this parameter as a reloption. We set it true only when the serious bug is discovered and we want to re-generate the visibility maps of specific tables. I thought that control by GUC parameter would be convenient rather than adding the new option. Regards, -- Masahiko Sawada
On Sat, Jun 4, 2016 at 12:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached updated patch. The error-checking enhancements here look good to me, except that you forgot to initialize totalBytesRead. I've committed those changes with a fix for that problem and will look at the rest of this separately. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Masahiko Sawada <sawada.mshk@gmail.com> writes: > On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> So other idea is to have GUC parameter, vacuum_even_frozen_page for example. >> If this parameter is set true (false by default), we do vacuum whole >> table forcibly and re-generate visibility map. >> The advantage of this idea is that we don't necessary to expand VACUUM >> syntax and relatively easily can remove this parameter if it's not >> necessary anymore. > Attached is a sample patch that controls full page vacuum by new GUC parameter. I find this approach fairly ugly ... it's randomly inconsistent with other VACUUM parameters for no very defensible reason. Taking out GUCs is not easier than taking out statement parameters; you risk breaking applications either way. regards, tom lane
On Mon, Jun 6, 2016 at 7:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jun 4, 2016 at 12:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Attached updated patch. > > The error-checking enhancements here look good to me, except that you > forgot to initialize totalBytesRead. I've committed those changes > with a fix for that problem and will look at the rest of this > separately. Committed that now, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 6, 2016 at 9:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Masahiko Sawada <sawada.mshk@gmail.com> writes: >> On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> So other idea is to have GUC parameter, vacuum_even_frozen_page for example. >>> If this parameter is set true (false by default), we do vacuum whole >>> table forcibly and re-generate visibility map. >>> The advantage of this idea is that we don't necessary to expand VACUUM >>> syntax and relatively easily can remove this parameter if it's not >>> necessary anymore. > >> Attached is a sample patch that controls full page vacuum by new GUC parameter. > > I find this approach fairly ugly ... it's randomly inconsistent with other > VACUUM parameters for no very defensible reason. Just to be sure I understand, in what way is it inconsistent? > Taking out GUCs is not > easier than taking out statement parameters; you risk breaking > applications either way. Agreed, but that doesn't really answer the question of which one we should have, if either. My gut feeling on this is to either do nothing or add a VACUUM option (not a GUC, not a reloption) called even_frozen_pages, default false. What is your opinion? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 6, 2016 at 9:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Taking out GUCs is not >> easier than taking out statement parameters; you risk breaking >> applications either way. > Agreed, but that doesn't really answer the question of which one we > should have, if either. My gut feeling on this is to either do > nothing or add a VACUUM option (not a GUC, not a reloption) called > even_frozen_pages, default false. What is your opinion? That's about where I stand, with some preference for "do nothing". I'm not convinced we need this. regards, tom lane
On Fri, Jun 3, 2016 at 11:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > (Status update for Noah: I expect Masahiko Sawada will respond > quickly, but if not I'll give some kind of update by Monday COB > anyhow.) I believe this open item is now closed, unless Andres has more comments or wishes to discuss any point further, with the exception that we still need to decide whether to add VACUUM (even_frozen_pages) or some variant of that. I have added a new open item for that issue and marked this one as resolved. My intended strategy as the presumptive owner of the new items is to do nothing unless more of a consensus emerges than we have presently. We do not seem to have clear agreement on whether to add the new option; whether to make it a GUC, a reloption, a VACUUM syntax option, or some combination of those things; and whether it should blow up the existing VM and rebuild it (as proposed by Sawada-san) or just force frozen pages to be scanned in the hope that something good will happen (as proposed by Andres). In the absence of consensus, doing nothing is a reasonable choice here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > My gut feeling on this is to either do nothing or add a VACUUM option > (not a GUC, not a reloption) called even_frozen_pages, default false. > What is your opinion? +1 for that approach -- I thought that was already agreed weeks ago and the only question was what to name that option. even_frozen_pages sounds better than SCANALL, SCAN_ALL, FREEZE, FORCE (the other options I saw proposed in that subthread), so +1 for that naming too. I don't like doing nothing; that means that when we discover a bug we'll have to tell users to rm a file whose name requires a complicated catalog query to find out, so -1 for that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 6, 2016 at 10:18 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> My gut feeling on this is to either do nothing or add a VACUUM option >> (not a GUC, not a reloption) called even_frozen_pages, default false. >> What is your opinion? > > +1 for that approach -- I thought that was already agreed weeks ago and > the only question was what to name that option. even_frozen_pages > sounds better than SCANALL, SCAN_ALL, FREEZE, FORCE (the other > options I saw proposed in that subthread), so +1 for that naming > too. > > I don't like doing nothing; that means that when we discover a bug we'll > have to tell users to rm a file whose name requires a complicated > catalog query to find out, so -1 for that. So... I agree that it is definitely not good if we have to tell users to rm a file, but I am not quite sure how this new option would prevent us from having to say that? Here are some potential kinds of bugs we might have: 1. Sometimes, the all-frozen bit doesn't get set when it should. 2. Sometimes, the all-frozen bit gets sit when it shouldn't. 3. Some combination of (1) and (2), so that the VM fork can't be trusted in either direction. If (1) happens, removing the VM fork is not a good idea; what people will want to do is re-run a VACUUM FREEZE. If (2) or (3) happens, removing the VM fork might be a good idea, but it's not really clear that VACUUM (even_frozen_pages) will help much. For one thing, if there are actually unfrozen tuples on those pages and the clog pages which they reference are already gone or recycled, rerunning VACUUM on the table in any form might permanently lose data, or maybe it will just fail. If because of the nature of the bug you somehow know that case doesn't pertain, then I suppose the bug is that the tuple-level and page-level state is out of sync. VACUUM (even_frozen_pages) probably won't help with that much either, because VACUUM never clears the all-frozen bit without also clearing the all-visible bit, and that only if the page contains dead tuples, which in this case it probably doesn't. I'm intuitively sympathetic to the idea that we should have an option for this, but I can't figure out in what case we'd actually tell anyone to use it. It would be useful for the kinds of bugs listed above to have VACUUM (rebuild_vm) to blow away the VM fork and rebuild it, but that's different semantics than what we proposed for VACUUM (even_frozen_pages). And I'd be sort of inclined to handle that case by providing some other way to remove VM forks (like a new function in the pg_visibilitymap contrib module, maybe?) and then just tell people to run regular VACUUM afterwards, rather than putting the actual VM fork removal into VACUUM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I'm intuitively sympathetic to the idea that we should have an option > for this, but I can't figure out in what case we'd actually tell > anyone to use it. It would be useful for the kinds of bugs listed > above to have VACUUM (rebuild_vm) to blow away the VM fork and rebuild > it, but that's different semantics than what we proposed for VACUUM > (even_frozen_pages). And I'd be sort of inclined to handle that case > by providing some other way to remove VM forks (like a new function in > the pg_visibilitymap contrib module, maybe?) and then just tell people > to run regular VACUUM afterwards, rather than putting the actual VM > fork removal into VACUUM. There's a lot to be said for that approach. If we do it, I'd be a bit inclined to offer an option to blow away the FSM as well. regards, tom lane
On 2016-06-06 05:34:32 -0400, Robert Haas wrote: > On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > >> Attached is a sample patch that controls full page vacuum by new GUC parameter. > > > > Don't we want a reloption for that? Just wondering... > > Why? Just for consistency? I think the bigger question here is > whether we need to do anything at all. It's true that, without some > new option, we'll lose the ability to forcibly vacuum every page in > the relation, even if all-frozen. But there's not much use case for > that in the first place. It will be potentially helpful if it turns > out that we have a bug that sets the all-frozen bit on pages that are > not, in fact, all-frozen. Otherwise, what's the use? Except that we right now don't have any realistic way to figure out whether this new feature actually does the right thing. Which makes testing this *considerably* harder than just VACUUM (dwim). I think it's unacceptable to release this feature without a way that'll tell that it so far has/has not corrupted the database. Would that, in a perfect world, be vacuum? No, probably not. But since we're not in a perfect world... Andres
On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-06 05:34:32 -0400, Robert Haas wrote: >> On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> >> Attached is a sample patch that controls full page vacuum by new GUC parameter. >> > >> > Don't we want a reloption for that? Just wondering... >> >> Why? Just for consistency? I think the bigger question here is >> whether we need to do anything at all. It's true that, without some >> new option, we'll lose the ability to forcibly vacuum every page in >> the relation, even if all-frozen. But there's not much use case for >> that in the first place. It will be potentially helpful if it turns >> out that we have a bug that sets the all-frozen bit on pages that are >> not, in fact, all-frozen. Otherwise, what's the use? > > Except that we right now don't have any realistic way to figure out > whether this new feature actually does the right thing. Which makes > testing this *considerably* harder than just VACUUM (dwim). I think it's > unacceptable to release this feature without a way that'll tell that it > so far has/has not corrupted the database. Would that, in a perfect > world, be vacuum? No, probably not. But since we're not in a perfect world... I just don't see how running VACUUM on the all-frozen pages is going to help. In terms of diagnostic tools, you can get the VM bits and page-level bits using the pg_visibility extension; I wrote it precisely because of concerns like the ones you raise here. If you want to cross-check the page-level bits against the tuple-level bits, you can do that with the pageinspect extension. And if you do those things, you can actually find out whether stuff is broken. Vacuuming the all-frozen pages won't tell you that. It will either do nothing (which doesn't tell you that things are OK) or it will change something (possibly without reporting any message, and possibly making a bad situation worse instead of better). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-06 11:37:25 -0400, Robert Haas wrote: > On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote: > > Except that we right now don't have any realistic way to figure out > > whether this new feature actually does the right thing. Which makes > > testing this *considerably* harder than just VACUUM (dwim). I think it's > > unacceptable to release this feature without a way that'll tell that it > > so far has/has not corrupted the database. Would that, in a perfect > > world, be vacuum? No, probably not. But since we're not in a perfect world... > > I just don't see how running VACUUM on the all-frozen pages is going > to help. Because we can tell people in the beta2 announcement or some wiki page "please run VACUUM(scan_all)" and check whether it emits WARNINGs. And if we suspect freeze map in bug reports, we can just ask reporters to run a VACUUM (scan_all). > In terms of diagnostic tools, you can get the VM bits and > page-level bits using the pg_visibility extension; I wrote it > precisely because of concerns like the ones you raise here. If you > want to cross-check the page-level bits against the tuple-level bits, > you can do that with the pageinspect extension. And if you do those > things, you can actually find out whether stuff is broken. That's WAY out ouf reach of any "normal users". Adding a vacuum option is doable, writing complex queries is not. > Vacuuming the all-frozen pages won't tell you that. It will either do > nothing (which doesn't tell you that things are OK) or it will change > something (possibly without reporting any message, and possibly making > a bad situation worse instead of better). We found a number of bugs for the equivalent issues in all-visible handling via the vacuum error reporting around those. Greetings, Andres Freund
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote: >> Except that we right now don't have any realistic way to figure out >> whether this new feature actually does the right thing. > I just don't see how running VACUUM on the all-frozen pages is going > to help. Yes. I don't see that any of the proposed features would be very useful for answering the question "is my VM incorrect". Maybe they would fix problems, and maybe not, but in any case you couldn't rely on VACUUM to tell you about a problem. (Even if you've got warning messages in there, they might disappear into the postmaster log during an auto-vacuum. Warning messages in VACUUM are not a good debugging technology.) regards, tom lane
On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote: >> In terms of diagnostic tools, you can get the VM bits and >> page-level bits using the pg_visibility extension; I wrote it >> precisely because of concerns like the ones you raise here. If you >> want to cross-check the page-level bits against the tuple-level bits, >> you can do that with the pageinspect extension. And if you do those >> things, you can actually find out whether stuff is broken. > > That's WAY out ouf reach of any "normal users". Adding a vacuum option > is doable, writing complex queries is not. Why would they have to write the complex query? Wouldn't they just need to run that we wrote for them? I mean, I'm not 100% dead set against this option you want, but in all honestly, I would never, ever tell anyone to use it. Unleashing VACUUM on possibly-damaged data is just asking it to decide to prune away tuples you don't want gone. I would try very hard to come up with something to give that user that was only going to *read* the possibly-damaged data with as little chance of modifying or erasing it as possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote: > >> In terms of diagnostic tools, you can get the VM bits and > >> page-level bits using the pg_visibility extension; I wrote it > >> precisely because of concerns like the ones you raise here. If you > >> want to cross-check the page-level bits against the tuple-level bits, > >> you can do that with the pageinspect extension. And if you do those > >> things, you can actually find out whether stuff is broken. > > > > That's WAY out ouf reach of any "normal users". Adding a vacuum option > > is doable, writing complex queries is not. > > Why would they have to write the complex query? Wouldn't they just > need to run that we wrote for them? > > I mean, I'm not 100% dead set against this option you want, but in all > honestly, I would never, ever tell anyone to use it. Unleashing > VACUUM on possibly-damaged data is just asking it to decide to prune > away tuples you don't want gone. I would try very hard to come up > with something to give that user that was only going to *read* the > possibly-damaged data with as little chance of modifying or erasing it > as possible. I certainly agree with this. We need a read-only utility which checks that the system is in a correct and valid state. There are a few of those which have been built for different pieces, I believe, and we really should have one for the visibility map, but I don't think it makes sense to imply in any way that VACUUM can or should be used for that. Thanks! Stephen
On 2016-06-06 14:24:14 -0400, Stephen Frost wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: > > On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote: > > >> In terms of diagnostic tools, you can get the VM bits and > > >> page-level bits using the pg_visibility extension; I wrote it > > >> precisely because of concerns like the ones you raise here. If you > > >> want to cross-check the page-level bits against the tuple-level bits, > > >> you can do that with the pageinspect extension. And if you do those > > >> things, you can actually find out whether stuff is broken. > > > > > > That's WAY out ouf reach of any "normal users". Adding a vacuum option > > > is doable, writing complex queries is not. > > > > Why would they have to write the complex query? Wouldn't they just > > need to run that we wrote for them? Then write that query. Verify that that query performs halfway reasonably fast. Document that it should be run against databases after subjecting them to tests. That'd address my concern as well. > > I mean, I'm not 100% dead set against this option you want, but in all > > honestly, I would never, ever tell anyone to use it. Unleashing > > VACUUM on possibly-damaged data is just asking it to decide to prune > > away tuples you don't want gone. I would try very hard to come up > > with something to give that user that was only going to *read* the > > possibly-damaged data with as little chance of modifying or erasing it > > as possible. I'm more concerned about actually being able to verify that the freeze logic does actually something meaningful, in situation where we'd *NOT* expect any problems. If we're not trusting vacuum in that situation, well ... > I certainly agree with this. > > We need a read-only utility which checks that the system is in a correct > and valid state. There are a few of those which have been built for > different pieces, I believe, and we really should have one for the > visibility map, but I don't think it makes sense to imply in any way > that VACUUM can or should be used for that. Meh. This is vacuum behaviour that *has existed* up to this point. You essentially removed it. Sure, I'm all for adding a verification tool. But that's just pie in the skie at this point. We have a complex, data loss threatening feature, which just about nobody can verify at this point. That's crazy. Greetings, Andres Freund
On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote: >> > Why would they have to write the complex query? Wouldn't they just >> > need to run that we wrote for them? > > Then write that query. Verify that that query performs halfway > reasonably fast. Document that it should be run against databases after > subjecting them to tests. That'd address my concern as well. You know, I am starting to lose a teeny bit of patience here. I do appreciate you reviewing this code, very much, and genuinely, and it would be great if more people wanted to review it. But this kind of reads like you think that I'm being a jerk, which I'm trying pretty hard not to be, and like you have the right to tell assign me arbitrary work, which I think you don't. If you want to have a reasonable conversation about what the options are for making this better, great. If you want to me to do some work to help improve things on a patch I committed, that is 100% fair. But I don't know what I did to earn this response which, to me, reads as rather demanding and rather exasperated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2016-06-06 15:16:10 -0400, Robert Haas wrote: > On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote: > >> > Why would they have to write the complex query? Wouldn't they just > >> > need to run that we wrote for them? > > > > Then write that query. Verify that that query performs halfway > > reasonably fast. Document that it should be run against databases after > > subjecting them to tests. That'd address my concern as well. > > You know, I am starting to lose a teeny bit of patience here. Same here. > I do appreciate you reviewing this code, very much, and genuinely, and > it would be great if more people wanted to review it. > But this kind of reads like you think that I'm being a jerk, which I'm > trying pretty hard not to be I don't think you're a jerk. But I am loosing a good bit of my patience here. I've posted these issues a month ago, and for a long while the only thing that happened was bikeshedding about the name of something that wasn't even decided to happen yet (obviously said bikeshedding isn't your fault). > and like you have the right to tell assign me arbitrary work, which I > think you don't. It's not like adding a parameter for this would be a lot of work, there's even a patch out there. I'm getting impatient because I feel the issue of this critical feature not being testable is getting ignored and/or played down. And then sidetracked into a general "let's add a database consistency checker" type discussion. Which we need, but won't get in 9.6. If you say: "I agree with the feature in principle, but I don't want to spend time to review/commit it." - ok, that's fair enough. But at the moment that isn't what I'm reading between the lines. > If you want to have a > reasonable conversation about what the options are for making this > better, great. Yes, I want that. > If you want to me to do some work to help improve things on a patch I > committed, that is 100% fair. But I don't know what I did to earn > this response which, to me, reads as rather demanding and rather > exasperated. I don't think it's absurd to make some demands on the committer of a impact-heavy feature, about at least finding a realistic path towards the new feature being realistically testable. This is a scary (but *REALLY IMPORTANT*) patch, and I don't understand why it's ok that we can't push a it through a couple wraparounds under high concurrency, and easily verify that the freeze map is in sync with the actual data. And yes, I *am* exasperated, that I'm the only one that appears to be scared by the lack of that capability. I think the feature is in a *lot* better shape than multixacts, but it certainly has the potential to do even more damage in ways that'll essentially be unrecoverable. Andres
Andres, all, * Andres Freund (andres@anarazel.de) wrote: > On 2016-06-06 15:16:10 -0400, Robert Haas wrote: > > On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote: > > and like you have the right to tell assign me arbitrary work, which I > > think you don't. > > It's not like adding a parameter for this would be a lot of work, > there's even a patch out there. I'm getting impatient because I feel > the issue of this critical feature not being testable is getting ignored > and/or played down. And then sidetracked into a general "let's add a > database consistency checker" type discussion. Which we need, but won't > get in 9.6. To be clear, I was pointing out that we've had similar types of consistency checkers implemented for other big features (eg: Heikki's work on checking that WAL works) and that it'd be good to have one here also. That could be as simple as a query with the right things installed, or it might be an independent tool, but not having any way to check isn't good. That said, trying to make VACUUM do that doesn't make sense to me either. Perhaps that's not an option due to the lateness of the hour or the lack of manpower behind it, but that doesn't seem to be what has been said so far. > > If you want to me to do some work to help improve things on a patch I > > committed, that is 100% fair. But I don't know what I did to earn > > this response which, to me, reads as rather demanding and rather > > exasperated. > > I don't think it's absurd to make some demands on the committer of a > impact-heavy feature, about at least finding a realistic path towards > the new feature being realistically testable. This is a scary (but > *REALLY IMPORTANT*) patch, and I don't understand why it's ok that we > can't push a it through a couple wraparounds under high concurrency, and > easily verify that the freeze map is in sync with the actual data. > > And yes, I *am* exasperated, that I'm the only one that appears to be > scared by the lack of that capability. I think the feature is in a > *lot* better shape than multixacts, but it certainly has the potential > to do even more damage in ways that'll essentially be unrecoverable. Not having a straightforward way to ensure that it's working properly is certainly concerning to me as well. Thanks! Stephen
On 2016-06-06 16:18:19 -0400, Stephen Frost wrote: > That could be as simple as a query with the right things installed, or > it might be an independent tool, but not having any way to check isn't > good. That said, trying to make VACUUM do that doesn't make sense to me > either. The point is that VACUUM *has* these types of checks. And had so for many years: else if (all_visible_according_to_vm && !PageIsAllVisible(page) && VM_ALL_VISIBLE(onerel, blkno,&vmbuffer)) { elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\"page %u", relname, blkno); visibilitymap_clear(onerel, blkno, vmbuffer); } ... else if (PageIsAllVisible(page) && has_dead_tuples) { elog(WARNING, "page containing dead tuples is markedas all-visible in relation \"%s\" page %u", relname, blkno); PageClearAllVisible(page); MarkBufferDirty(buf); visibilitymap_clear(onerel, blkno, vmbuffer); } the point is that, after the introduction of the freeze bit, there's no way to reach them anymore (and they're missing a useful extension of these warnings, but ...); these warnings have caught bugs. I don't think it'd advocate for the vacuum option otherwise. Greetings, Andres Freund
On Mon, Jun 6, 2016 at 4:06 PM, Andres Freund <andres@anarazel.de> wrote: >> I do appreciate you reviewing this code, very much, and genuinely, and >> it would be great if more people wanted to review it. > >> But this kind of reads like you think that I'm being a jerk, which I'm >> trying pretty hard not to be > > I don't think you're a jerk. But I am loosing a good bit of my patience > here. I've posted these issues a month ago, and for a long while the > only thing that happened was bikeshedding about the name of something > that wasn't even decided to happen yet (obviously said bikeshedding > isn't your fault). No, the bikeshedding is not my fault. As for the timing, you posted your first comments exactly a week before beta1 when I was still busy addressing issues that were reported before you reported yours, and I did not think it was realistic to get them addressed in the time available. If you'd sent them two weeks sooner, I would probably have done so. Now, it's been four weeks since beta1 wrapped, one of which was PGCon. As far as I understand at this point in time, your review identified exactly zero potential data loss bugs. (We thought there was one, but it looks like there isn't.) All of the non-critical defects you identified have now been fixed, apart from the lack of a better testing tool. And since there is ongoing discussion (call it bikeshedding if you want) about what would actually help in that area, I really don't feel like anything very awful is happening here. I really don't understand how you can not weigh in on the original thread leading up to my mid-March commits and say "hey, this needs a better testing tool", and then when you finally get around to reviewing it in May, I'm supposed to drop everything and write one immediately. Why do you get two months from the time of commit to weigh in but I get no time to respond? For my part, I thought I *had* written a testing tool - that's what pg_visibility is and that's what I used to test the feature before committing it. Now, you think that's not good enough, and I respect your opinion, but it's not as if you said this back when this was being committed. Or at least if you did, I don't remember it. >> and like you have the right to tell assign me arbitrary work, which I >> think you don't. > > It's not like adding a parameter for this would be a lot of work, > there's even a patch out there. I'm getting impatient because I feel > the issue of this critical feature not being testable is getting ignored > and/or played down. And then sidetracked into a general "let's add a > database consistency checker" type discussion. Which we need, but won't > get in 9.6. I know there's a patch. Both Tom and I are skeptical about whether it adds value, and I really don't think you've spelled out in as much detail why you think it will help as I have why I think it won't. Initially, I was like "ok, sure, we should have that", but the more I thought about it (another advantage of time passing: you can think about things more) the less convinced I was that it did anything useful. I don't think that's very unreasonable. The importance of the feature is exactly why we *should* think carefully about what is best here and not just do the first thing that pops into our head. > If you say: "I agree with the feature in principle, but I don't want to > spend time to review/commit it." - ok, that's fair enough. But at the > moment that isn't what I'm reading between the lines. No, what I'm saying is "I'm not confident that this feature adds value, and I'm afraid that by adding it we are making ourselves feel better without solving any real problem". I'm also saying "let's try to agree on what problems we need to solve first and then decide on the solutions". >> If you want to have a >> reasonable conversation about what the options are for making this >> better, great. > > Yes, I want that. Great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote: >> > Why would they have to write the complex query? Wouldn't they just >> > need to run that we wrote for them? > > Then write that query. Verify that that query performs halfway > reasonably fast. Document that it should be run against databases after > subjecting them to tests. That'd address my concern as well. Here is a first attempt at such a query. It requires that the pageinspect and pg_visibility extensions be installed. SELECT c.oid, v.blkno, array_agg(hpi.lp) AS affect_lps FROM pg_class c, LATERAL ROWS FROM (pg_visibility(c.oid)) v, LATERAL ROWS FROM (heap_page_items(get_raw_page(c.oid::regclass::text, blkno::int4))) hpi WHERE c.relkind IN ('r', 't', 'm') AND v.all_frozen AND (((hpi.t_infomask & 768) != 768 AND hpi.t_xmin NOT IN (1, 2)) OR (hpi.t_infomask & 2048) != 2048) GROUP BY 1, 2 ORDER BY 1, 2; I am not sure this is 100% correct, especially the XMAX-checking part: is HEAP_XMAX_INVALID guaranteed to be set on a fully-frozen tuple? Is the method of constructing the first argument to get_raw_page() going to be robust in all cases? I'm not sure what the performance will be on a large table, either. That will have to be checked. And I obviously have not done extensive stress runs yet. But maybe it's a start. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 6, 2016 at 11:35 AM, Andres Freund <andres@anarazel.de> wrote: >> We need a read-only utility which checks that the system is in a correct >> and valid state. There are a few of those which have been built for >> different pieces, I believe, and we really should have one for the >> visibility map, but I don't think it makes sense to imply in any way >> that VACUUM can or should be used for that. > > Meh. This is vacuum behaviour that *has existed* up to this point. You > essentially removed it. Sure, I'm all for adding a verification > tool. But that's just pie in the skie at this point. We have a complex, > data loss threatening feature, which just about nobody can verify at > this point. That's crazy. FWIW, I agree with the general sentiment. Building a stress-testing suite would have been a good idea. In general, testability is a design goal that I'd be willing to give up other things for. -- Peter Geoghegan
On Mon, Jun 6, 2016 at 4:27 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-06 16:18:19 -0400, Stephen Frost wrote: >> That could be as simple as a query with the right things installed, or >> it might be an independent tool, but not having any way to check isn't >> good. That said, trying to make VACUUM do that doesn't make sense to me >> either. > > The point is that VACUUM *has* these types of checks. And had so for > many years: > else if (all_visible_according_to_vm && !PageIsAllVisible(page) > && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer)) > { > elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\"page %u", > relname, blkno); > visibilitymap_clear(onerel, blkno, vmbuffer); > } > ... > else if (PageIsAllVisible(page) && has_dead_tuples) > { > elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u", > relname, blkno); > PageClearAllVisible(page); > MarkBufferDirty(buf); > visibilitymap_clear(onerel, blkno, vmbuffer); > } > > the point is that, after the introduction of the freeze bit, there's no > way to reach them anymore (and they're missing a useful extension of > these warnings, but ...); these warnings have caught bugs. I don't > think it'd advocate for the vacuum option otherwise. So a couple of things: 1. I think it is pretty misleading to say that those checks aren't reachable any more. It's not like we freeze every page when we mark it all-visible. In most cases, I think that what will happen is that the page will be marked all-visible and then, because it is all-visible, skipped by subsequent vacuums, so that it doesn't get marked all-frozen until a few hundred million transactions later. Of course there will be some cases when a page gets marked all-visible and all-frozen at the same time, but I don't see why we should expect that to be the norm. 2. With the new pg_visibility extension, you can actually check the same thing that first warning checks like this: select * from pg_visibility('t1'::regclass) where all_visible and not pd_all_visible; IMHO, that's a substantial improvement over running VACUUM and checking whether it spits out a WARNING. The second one, you can't currently trigger for all-frozen pages. The query I just sent in my other email could perhaps be adapted to that purpose, but maybe this is a good-enough reason to add VACUUM (even_frozen_pages). 3. If you think there are analogous checks that I should add for the frozen case, or that you want to add yourself, please say what they are specifically. I *did* think about it when I wrote that code and I didn't see how to make it work. If I had, I would have added them. The whole point of review here is, hopefully, to illuminate what should have been done differently - if I'd known how to do it better, I would have done so. Provide an idea, or better yet, provide a patch. If you see how to do it, coding it up shouldn't be the hard part. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-06 16:41:19 -0400, Robert Haas wrote: > I really don't understand how you can not weigh in on the original > thread leading up to my mid-March commits and say "hey, this needs a > better testing tool", and then when you finally get around to > reviewing it in May, I'm supposed to drop everything and write one > immediately. Meh. Asking you to "drop everything" and starting to push a month later are very different things. The reason I'm pushing is because this atm seems likely to slip enough that we'll decide "can't do this for 9.6". And I think that'd be seriously bad. > Why do you get two months from the time of commit to weigh in but I > get no time to respond? Really? You've started to apply pressure to fix things days after they've been discovered. It's been a month. > For my part, I thought I *had* > written a testing tool - that's what pg_visibility is and that's what > I used to test the feature before committing it. I think looking only at page level data, and not at row level data is is insufficient. And I think we need to make $tool output the data in a way that only returns data if things are wrong (that can be a pre-canned query). > Now, you think that's not good enough, and I respect your opinion, but > it's not as if you said this back when this was being committed. Or > at least if you did, I don't remember it. I think I mentioned testing ages ago, but not around the commit, no. I kind of had assumed that it was there. I don't think that's really relevant though. Backend flushing was discussed and benchmarked over months as well; and while I don't agree with your, conclusion it's absolutely sane of you to push for changing the default on that; even if you didn't immediately push back. > I know there's a patch. Both Tom and I are skeptical about whether it > adds value, and I really don't think you've spelled out in as much > detail why you think it will help as I have why I think it won't. The primary reason I think it'll help because it allows users/testers to run a simple one-line command (VACUUM (scan_all);)in their database, and they'll get a clear "WARNING: XXX is bad" message if something's broken, and nothing if things are ok. Vacuum isn't a bad place for that, because it'll be the place that removes dead item pointers and such if things were wrongly labeled; and because we historically have emitted warnings from there. The more complex stuff we ask testers to run, the less likely it is that they'll actually do that. I'd also be ok with adding & documenting (beta release notes) CREATE EXTENSION pg_visibility; SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid); or something olong those lines. Greetings, Andres Freund
On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-06 16:41:19 -0400, Robert Haas wrote: >> I really don't understand how you can not weigh in on the original >> thread leading up to my mid-March commits and say "hey, this needs a >> better testing tool", and then when you finally get around to >> reviewing it in May, I'm supposed to drop everything and write one >> immediately. > > Meh. Asking you to "drop everything" and starting to push a month later > are very different things. The reason I'm pushing is because this atm > seems likely to slip enough that we'll decide "can't do this for > 9.6". And I think that'd be seriously bad. To be clear, I'm not objecting to you pushing on this. I just think your tone sounds a bit, uh, antagonized. >> Why do you get two months from the time of commit to weigh in but I >> get no time to respond? > > Really? You've started to apply pressure to fix things days after > they've been discovered. It's been a month. Yes, it would have been nice if I had gotten to this one sooner. But it's not like you said "hey, hurry up" before I started working on it. You waited until I did start working on it and *then* complained that I didn't get to it sooner. I cannot rewind time. >> For my part, I thought I *had* >> written a testing tool - that's what pg_visibility is and that's what >> I used to test the feature before committing it. > > I think looking only at page level data, and not at row level data is is > insufficient. And I think we need to make $tool output the data in a way > that only returns data if things are wrong (that can be a pre-canned > query). OK. I didn't think that was necessary, but it sure can't hurt. >> I know there's a patch. Both Tom and I are skeptical about whether it >> adds value, and I really don't think you've spelled out in as much >> detail why you think it will help as I have why I think it won't. > > The primary reason I think it'll help because it allows users/testers to > run a simple one-line command (VACUUM (scan_all);)in their database, and > they'll get a clear "WARNING: XXX is bad" message if something's broken, > and nothing if things are ok. Vacuum isn't a bad place for that, > because it'll be the place that removes dead item pointers and such if > things were wrongly labeled; and because we historically have emitted > warnings from there. The more complex stuff we ask testers to run, the > less likely it is that they'll actually do that. OK, now I understand. Let's see if there is general agreement on this and then we can decide how to proceed. I think the main danger here is that people will think that this option is more useful than it really is and start using it in all kinds of cases where it isn't really necessary in the hopes that it will fix problems it really can't fix. I think we need to write the documentation in such a way as to be deeply discouraging to people who might otherwise be prone to unwarranted optimism. Otherwise, 5 years from now, we're going to be fielding complaints from people who are unhappy that there's no way to make autovacuum run with (even_frozen_pages true). > I'd also be ok with adding & documenting (beta release notes) > CREATE EXTENSION pg_visibility; > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid); > or something olong those lines. That wouldn't be too useful as-written in my book, because it gives you no detail on what exactly the problem was. Maybe it could be "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned TIDs are non-frozen TIDs on frozen pages. Then I think something like this would work: SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind IN ('r', 't', 'm'); If you get any rows back, you've got trouble. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2016-06-06 17:00:19 -0400, Robert Haas wrote: > 1. I think it is pretty misleading to say that those checks aren't > reachable any more. It's not like we freeze every page when we mark > it all-visible. True. What I mean is that you can't force the checks (and some that I think should be added) to occur anymore. Once a page is frozen it'll be kinda hard to predict whether vacuum touches it (due to the skip logic). > 2. With the new pg_visibility extension, you can actually check the > same thing that first warning checks like this: > > select * from pg_visibility('t1'::regclass) where all_visible and not > pd_all_visible; Right, but not the second. > IMHO, that's a substantial improvement over running VACUUM and > checking whether it spits out a WARNING. I think it's a mixed bag. I do think that WARNINGS are a lot easier to understand for a casual user/tester; rather than having to write/copy queries which return results where you don't know what the expected result is. I agree that it's better to have that in a non-modifying way - although I'm afraid atm it's not really possible to do a HeapTupleSatisfies* without modifications :(. > 3. If you think there are analogous checks that I should add for the > frozen case, or that you want to add yourself, please say what they > are specifically. I *did* think about it when I wrote that code and I > didn't see how to make it work. If I had, I would have added them. > The whole point of review here is, hopefully, to illuminate what > should have been done differently - if I'd known how to do it better, > I would have done so. Provide an idea, or better yet, provide a > patch. If you see how to do it, coding it up shouldn't be the hard > part. I think it's pretty important (and not hard) to add a check for (all_frozen_according_to_vm && has_unfrozen_tuples). Checking for VM_ALL_FROZEN && !VM_ALL_VISIBLE looks worthwhile as well, especially as we could check that always, without a measurable overhead. But the former primarily makes sense if we have a way to force the check to occur in a way that's not dependant on the state of neighbouring pages. Greetings, Andres Freund
Hi, On 2016-06-06 17:22:38 -0400, Robert Haas wrote: > > I'd also be ok with adding & documenting (beta release notes) > > CREATE EXTENSION pg_visibility; > > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid); > > or something olong those lines. > > That wouldn't be too useful as-written in my book, because it gives > you no detail on what exactly the problem was. True. I don't think that's a big issue though, because we'd likely want a lot more detail after a report anyway; to analyze things properly. > Maybe it could be > "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned > TIDs are non-frozen TIDs on frozen pages. Then I think something like > this would work: > > SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind > IN ('r', 't', 'm'); > > If you get any rows back, you've got trouble. That'd work too; with the slight danger of returning way too much data. - Andres
On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> > I'd also be ok with adding & documenting (beta release notes)
> > CREATE EXTENSION pg_visibility;
> > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
> > or something olong those lines.
>
> That wouldn't be too useful as-written in my book, because it gives
> you no detail on what exactly the problem was. Maybe it could be
> "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
> TIDs are non-frozen TIDs on frozen pages. Then I think something like
> this would work:
>
> SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
> IN ('r', 't', 'm');
>
I have implemented the above function in attached patch. Currently, it returns SETOF tupleids, but if we want some variant of same, that should also be possible.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
>
> On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> > I'd also be ok with adding & documenting (beta release notes)
> > CREATE EXTENSION pg_visibility;
> > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
> > or something olong those lines.
>
> That wouldn't be too useful as-written in my book, because it gives
> you no detail on what exactly the problem was. Maybe it could be
> "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
> TIDs are non-frozen TIDs on frozen pages. Then I think something like
> this would work:
>
> SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
> IN ('r', 't', 'm');
>
I have implemented the above function in attached patch. Currently, it returns SETOF tupleids, but if we want some variant of same, that should also be possible.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jun 7, 2016 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote: >> >> >> > I'd also be ok with adding & documenting (beta release notes) >> > CREATE EXTENSION pg_visibility; >> > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT >> > pg_check_visibility(oid); >> > or something olong those lines. >> >> That wouldn't be too useful as-written in my book, because it gives >> you no detail on what exactly the problem was. Maybe it could be >> "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned >> TIDs are non-frozen TIDs on frozen pages. Then I think something like >> this would work: >> >> SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind >> IN ('r', 't', 'm'); >> > > I have implemented the above function in attached patch. Currently, it > returns SETOF tupleids, but if we want some variant of same, that should > also be possible. > > > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com Thank you for implementing the patch. I've not test it deeply but here are some comments. This check tool only checks if the frozen page has live-unfrozen tuple. That is, it doesn't care in case where the all-frozen page mistakenly has dead-frozen tuple. I think this tool should check such case, otherwise the function name would need to be changed. + /* Clean up. */ + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); I think that we should use BufferIsValid() here. Regards, -- Masahiko Sawada
On 2016-06-07 19:49:59 +0530, Amit Kapila wrote: > On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote: > > > > > > > I'd also be ok with adding & documenting (beta release notes) > > > CREATE EXTENSION pg_visibility; > > > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT > pg_check_visibility(oid); > > > or something olong those lines. > > > > That wouldn't be too useful as-written in my book, because it gives > > you no detail on what exactly the problem was. Maybe it could be > > "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned > > TIDs are non-frozen TIDs on frozen pages. Then I think something like > > this would work: > > > > SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind > > IN ('r', 't', 'm'); > > > > I have implemented the above function in attached patch. Currently, it > returns SETOF tupleids, but if we want some variant of same, that should > also be possible. Cool! I think if we go with the pg_check_visibility approach, we should also copy the other consistency checks from vacuumlazy.c, given they can't easily be triggered. Wonder how we can report both block and tuple level issues. Kinda inclined to report everything as a block level issue? Regards, Andres
On 6/6/16 3:57 PM, Peter Geoghegan wrote: > On Mon, Jun 6, 2016 at 11:35 AM, Andres Freund <andres@anarazel.de> wrote: >>> We need a read-only utility which checks that the system is in a correct >>> and valid state. There are a few of those which have been built for >>> different pieces, I believe, and we really should have one for the >>> visibility map, but I don't think it makes sense to imply in any way >>> that VACUUM can or should be used for that. >> >> Meh. This is vacuum behaviour that *has existed* up to this point. You >> essentially removed it. Sure, I'm all for adding a verification >> tool. But that's just pie in the skie at this point. We have a complex, >> data loss threatening feature, which just about nobody can verify at >> this point. That's crazy. > > FWIW, I agree with the general sentiment. Building a stress-testing > suite would have been a good idea. In general, testability is a design > goal that I'd be willing to give up other things for. Related to that, I suspect it would be helpful if it was possible to test boundary cases in this kind of critical code by separating the logic from the underlying implementation. It becomes very hard to verify the system does the right thing in some of these scenarios, because it's so difficult to put the system into that state to begin with. Stuff that depends on burning through a large number of XIDs is an example of that. (To be clear, I'm talking about unit-test kind of stuff here, not validating an existing system.) -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Tue, Jun 7, 2016 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have implemented the above function in attached patch. Currently, it > returns SETOF tupleids, but if we want some variant of same, that should > also be possible. I think we'd want to bump the pg_visibility version to 1.1 and do the upgrade dance, since the existing thing was in beta1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 7, 2016 at 10:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Thank you for implementing the patch.
>
> I've not test it deeply but here are some comments.
> This check tool only checks if the frozen page has live-unfrozen tuple.
> That is, it doesn't care in case where the all-frozen page mistakenly
> has dead-frozen tuple.
Do you mean to say that we should have a check for ItemIdIsDead() and then if item is found to be dead, then add it to array of non_frozen items? If so, earlier I thought we might not need this check as we are already using heap_tuple_needs_eventual_freeze(), but now again looking at it, it seems wise to check for dead items separately as those won't be covered by other check.
>
> + /* Clean up. */
> + if (vmbuffer != InvalidBuffer)
> + ReleaseBuffer(vmbuffer);
>
> I think that we should use BufferIsValid() here.
>
>
> Thank you for implementing the patch.
>
> I've not test it deeply but here are some comments.
> This check tool only checks if the frozen page has live-unfrozen tuple.
> That is, it doesn't care in case where the all-frozen page mistakenly
> has dead-frozen tuple.
>
Do you mean to say that we should have a check for ItemIdIsDead() and then if item is found to be dead, then add it to array of non_frozen items? If so, earlier I thought we might not need this check as we are already using heap_tuple_needs_eventual_freeze(), but now again looking at it, it seems wise to check for dead items separately as those won't be covered by other check.
>
> + /* Clean up. */
> + if (vmbuffer != InvalidBuffer)
> + ReleaseBuffer(vmbuffer);
>
> I think that we should use BufferIsValid() here.
>
We can use BufferIsValid() as well, but I am trying to be consistent with nearby code, refer collect_visibility_data(). We can change at all places together if people prefer that way.
On Wed, Jun 8, 2016 at 8:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jun 7, 2016 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have implemented the above function in attached patch. Currently, it
> > returns SETOF tupleids, but if we want some variant of same, that should
> > also be possible.
>
> I think we'd want to bump the pg_visibility version to 1.1 and do the
> upgrade dance, since the existing thing was in beta1.
>
>
> On Tue, Jun 7, 2016 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have implemented the above function in attached patch. Currently, it
> > returns SETOF tupleids, but if we want some variant of same, that should
> > also be possible.
>
> I think we'd want to bump the pg_visibility version to 1.1 and do the
> upgrade dance, since the existing thing was in beta1.
>
Okay, will do it in next version of patch.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:>
> I think if we go with the pg_check_visibility approach, we should also
> copy the other consistency checks from vacuumlazy.c, given they can't
> easily be triggered.
Are you referring to checks that are done in lazy_scan_heap() for each block? I think the meaning full checks in this context could be (a) page is marked as visible, but corresponding vm is not marked. (b) page is marked as all visible and has dead tuples. (c) vm bit indicates frozen, but page contains non-frozen tuples.
> I think if we go with the pg_check_visibility approach, we should also
> copy the other consistency checks from vacuumlazy.c, given they can't
> easily be triggered.
Are you referring to checks that are done in lazy_scan_heap() for each block? I think the meaning full checks in this context could be (a) page is marked as visible, but corresponding vm is not marked. (b) page is marked as all visible and has dead tuples. (c) vm bit indicates frozen, but page contains non-frozen tuples.
I think right now the design of pg_visibility is such that it returns the required information at page level to user by means of various functions like pg_visibility, pg_visibility_map, etc. If we want to add page level checks in this new routine as well, then we have to think what should be the output if such checks fails, shall we issue warning, shall we return information in some other way. Also, I think there will be some duplicity with the already provided information via other functions of this module.
>
> Wonder how we can report both block and tuple
> level issues. Kinda inclined to report everything as a block level
> issue?
>
The way currently this module provides information, it seems better to have separate API's for block and tuple level inconsistency. For block level, I think most of the information can be retrieved by existing API's and for tuple level, this new API can be used.
On 2016-06-08 10:04:56 +0530, Amit Kapila wrote: > On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:> > > I think if we go with the pg_check_visibility approach, we should also > > copy the other consistency checks from vacuumlazy.c, given they can't > > easily be triggered. > > Are you referring to checks that are done in lazy_scan_heap() for each > block? Yes. > I think the meaning full checks in this context could be (a) page > is marked as visible, but corresponding vm is not marked. (b) page is > marked as all visible and has dead tuples. (c) vm bit indicates frozen, but > page contains non-frozen tuples. Yes. > I think right now the design of pg_visibility is such that it returns the > required information at page level to user by means of various functions > like pg_visibility, pg_visibility_map, etc. If we want to add page level > checks in this new routine as well, then we have to think what should be > the output if such checks fails, shall we issue warning, shall we return > information in some other way. Right. > Also, I think there will be some duplicity > with the already provided information via other functions of this module. Don't think that's a problem. One part of the functionality then is returning the available information, the other is checking for problems and only returning problematic blocks. > > Wonder how we can report both block and tuple > > level issues. Kinda inclined to report everything as a block level > > issue? > > > > The way currently this module provides information, it seems better to have > separate API's for block and tuple level inconsistency. For block level, I > think most of the information can be retrieved by existing API's and for > tuple level, this new API can be used. I personally think simplicity is more important than detail here; but it's not that important. If this reports a problem, you can look into the nitty gritty using existing functions. Andres
On Wed, Jun 8, 2016 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jun 7, 2016 at 10:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> Thank you for implementing the patch. >> >> I've not test it deeply but here are some comments. >> This check tool only checks if the frozen page has live-unfrozen tuple. >> That is, it doesn't care in case where the all-frozen page mistakenly >> has dead-frozen tuple. >> > > Do you mean to say that we should have a check for ItemIdIsDead() and then > if item is found to be dead, then add it to array of non_frozen items? Yes. > If so, earlier I thought we might not need this check as we are already using > heap_tuple_needs_eventual_freeze(), You're right. Sorry, I had misunderstood. > but now again looking at it, it seems > wise to check for dead items separately as those won't be covered by other > check. Sounds good. >> >> + /* Clean up. */ >> + if (vmbuffer != InvalidBuffer) >> + ReleaseBuffer(vmbuffer); >> >> I think that we should use BufferIsValid() here. >> > > We can use BufferIsValid() as well, but I am trying to be consistent with > nearby code, refer collect_visibility_data(). We can change at all places > together if people prefer that way. > In vacuumlazy.c we use it like BufferisValid(vmbuffer), so I think we can replace all these thing to be more safety if there is not specific reason. Regards, -- Masahiko Sawada
On Wed, Jun 8, 2016 at 11:39 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-08 10:04:56 +0530, Amit Kapila wrote:
> > On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:>
> > > I think if we go with the pg_check_visibility approach, we should also
> > > copy the other consistency checks from vacuumlazy.c, given they can't
> > > easily be triggered.
> >
> > Are you referring to checks that are done in lazy_scan_heap() for each
> > block?
>
> Yes.
>
>
> > I think the meaning full checks in this context could be (a) page
> > is marked as visible, but corresponding vm is not marked. (b) page is
> > marked as all visible and has dead tuples. (c) vm bit indicates frozen, but
> > page contains non-frozen tuples.
>
> Yes.
>
>
> On 2016-06-08 10:04:56 +0530, Amit Kapila wrote:
> > On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:>
> > > I think if we go with the pg_check_visibility approach, we should also
> > > copy the other consistency checks from vacuumlazy.c, given they can't
> > > easily be triggered.
> >
> > Are you referring to checks that are done in lazy_scan_heap() for each
> > block?
>
> Yes.
>
>
> > I think the meaning full checks in this context could be (a) page
> > is marked as visible, but corresponding vm is not marked. (b) page is
> > marked as all visible and has dead tuples. (c) vm bit indicates frozen, but
> > page contains non-frozen tuples.
>
> Yes.
>
If we want to address both page level and tuple level inconsistencies, I could see below possibility.
1. An API that returns setof records containing a block that have inconsistent vm bit, a block where visible page contains dead tuples and a block where vm bit indicates frozen, but page contains non-frozen tuples. Three separate block numbers are required in record to distinguish the problem with block.
Signature of API will be something like:
pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint, corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS SETOF record
2. An API that provides information of non-frozen tuples on a frozen page
Signature of API:
CREATE FUNCTION pg_check_visibility_tuples(regclass, t_ctid OUT tid) RETURNS SETOF tid
This is same as what is present in current patch [1].
In this, user can use first API to find corrupt blocks if any and if further information is required, second API can be used.
Does that address your concern? If you, Robert and others are okay with above idea, then I will send an update patch.
On Wed, Jun 8, 2016 at 4:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > If we want to address both page level and tuple level inconsistencies, I > could see below possibility. > > 1. An API that returns setof records containing a block that have > inconsistent vm bit, a block where visible page contains dead tuples and a > block where vm bit indicates frozen, but page contains non-frozen tuples. > Three separate block numbers are required in record to distinguish the > problem with block. > > Signature of API will be something like: > pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint, > corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS > SETOF record I don't understand this, and I think we're making this too complicated. The function that just returned non-frozen TIDs on supposedly-frozen pages was simple. Now we're trying to redesign this into a general-purpose integrity checker on the eve of beta2, and I think that's a bad idea. We don't have time to figure that out, get consensus on it, and do it well, and I don't want to be stuck supporting something half-baked from now until eternity. Let's scale back our goals here to something that can realistically be done well in the time available. Here's my proposal: 1. You already implemented a function to find non-frozen tuples on supposedly all-frozen pages. Great. 2. Let's implement a second function to find dead tuples on supposedly all-visible pages. 3. And then let's call it good. If we start getting into the game of "well, that's not enough because you can also check for X", that's an infinite treadmill. There will always be more things we can check. But that's the project of building an integrity checker, which while worthwhile, is out of scope for 9.6. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 8, 2016 at 4:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > If we want to address both page level and tuple level inconsistencies, I
> > could see below possibility.
> >
> > 1. An API that returns setof records containing a block that have
> > inconsistent vm bit, a block where visible page contains dead tuples and a
> > block where vm bit indicates frozen, but page contains non-frozen tuples.
> > Three separate block numbers are required in record to distinguish the
> > problem with block.
> >
> > Signature of API will be something like:
> > pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint,
> > corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS
> > SETOF record
>
> I don't understand this,
>
> On Wed, Jun 8, 2016 at 4:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > If we want to address both page level and tuple level inconsistencies, I
> > could see below possibility.
> >
> > 1. An API that returns setof records containing a block that have
> > inconsistent vm bit, a block where visible page contains dead tuples and a
> > block where vm bit indicates frozen, but page contains non-frozen tuples.
> > Three separate block numbers are required in record to distinguish the
> > problem with block.
> >
> > Signature of API will be something like:
> > pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint,
> > corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS
> > SETOF record
>
> I don't understand this,
This new API was to address Andres's concern of checking block level inconsistency as we do in lazy_scan_heap. It returns set of inconsistent blocks.
>
> The function that just returned non-frozen TIDs on
> supposedly-frozen pages was simple. Now we're trying to redesign this
> into a general-purpose integrity checker on the eve of beta2, and I
> think that's a bad idea. We don't have time to figure that out, get
> consensus on it, and do it well, and I don't want to be stuck
> supporting something half-baked from now until eternity. Let's scale
> back our goals here to something that can realistically be done well
> in the time available.
>
> Here's my proposal:
>
> 1. You already implemented a function to find non-frozen tuples on
> supposedly all-frozen pages. Great.
>
> 2. Let's implement a second function to find dead tuples on supposedly
> all-visible pages.
>
> 3. And then let's call it good.
>
Your proposal sounds good, will send an updated patch, if there are no further concerns.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> Here's my proposal:
>
>
>
> Here's my proposal:
>
> 1. You already implemented a function to find non-frozen tuples on
> supposedly all-frozen pages. Great.
>
> 2. Let's implement a second function to find dead tuples on supposedly
> all-visible pages.
>
> supposedly all-frozen pages. Great.
>
> 2. Let's implement a second function to find dead tuples on supposedly
> all-visible pages.
>
I am planning to name them as pg_check_frozen and pg_check_visible, let me know if you something else suits better?
On Thu, Jun 9, 2016 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> >
> > Here's my proposal:
> >
> > 1. You already implemented a function to find non-frozen tuples on
> > supposedly all-frozen pages. Great.
> >
> > 2. Let's implement a second function to find dead tuples on supposedly
> > all-visible pages.
> >
>
> I am planning to name them as pg_check_frozen and pg_check_visible, let me know if you something else suits better?
>
Attached patch implements the above 2 functions. I have addressed the comments by Sawada San and you in latest patch and updated the documentation as well.
>
> On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> >
> > Here's my proposal:
> >
> > 1. You already implemented a function to find non-frozen tuples on
> > supposedly all-frozen pages. Great.
> >
> > 2. Let's implement a second function to find dead tuples on supposedly
> > all-visible pages.
> >
>
> I am planning to name them as pg_check_frozen and pg_check_visible, let me know if you something else suits better?
>
Attached patch implements the above 2 functions. I have addressed the comments by Sawada San and you in latest patch and updated the documentation as well.
Attachment
On Thu, Jun 9, 2016 at 5:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Attached patch implements the above 2 functions. I have addressed the > comments by Sawada San and you in latest patch and updated the documentation > as well. I made a number of changes to this patch. Here is the new version. 1. The algorithm you were using for growing the array size is unsafe and can easily overrun the array. Suppose that each of the first two pages have some corrupt tuples, more than 50% of MaxHeapTuplesPerPage but less than the full value of MaxTuplesPerPage. Your code will conclude that the array does need to be enlarged after processing the first page. I switched this to what I consider the normal coding pattern for such problems. 2. The all-visible checks seemed to me to be incorrect and incomplete. I made the check match the logic in lazy_scan_heap. 3. Your 1.0 -> 1.1 upgrade script was missing copies of the REVOKE statements you added to the 1.1 script. I added them. 4. The tests as written were not safe under concurrency; they could return spurious results if the page changed between the time you checked the visibility map and the time you actually examined the tuples. I think people will try running these functions on live systems, so I changed the code to recheck the VM bits after locking the page. Unfortunately, there's either still a concurrency-related problem here or there's a bug in the all-frozen code itself because I once managed to get pg_check_frozen('pgbench_accounts') to return a TID while pgbench was running concurrently. That's a bit alarming, but since I can't reproduce it I don't really have a clue how to track down the problem. 5. I made various cosmetic improvements. If there are not objections, I will go ahead and commit this tomorrow, because even if there is a bug (see point #4 above) I think it's better to have this in the tree than not. However, code review and/or testing with these new functions seems like it would be an extremely good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi Robert, Amit, thanks for working on this. On 2016-06-09 12:11:15 -0400, Robert Haas wrote: > 4. The tests as written were not safe under concurrency; they could > return spurious results if the page changed between the time you > checked the visibility map and the time you actually examined the > tuples. I think people will try running these functions on live > systems, so I changed the code to recheck the VM bits after locking > the page. Unfortunately, there's either still a concurrency-related > problem here or there's a bug in the all-frozen code itself because I > once managed to get pg_check_frozen('pgbench_accounts') to return a > TID while pgbench was running concurrently. That's a bit alarming, > but since I can't reproduce it I don't really have a clue how to track > down the problem. Ugh, that's a bit concerning. > If there are not objections, I will go ahead and commit this tomorrow, > because even if there is a bug (see point #4 above) I think it's > better to have this in the tree than not. However, code review and/or > testing with these new functions seems like it would be an extremely > good idea. I'll try to spend some time on that today (code review & testing). Andres
Hi, I found two, relatively minor, issues. 1) I think we should perform a relkind check in collect_corrupt_items(). Atm we'll "gladly" run against an index. If weactually entered the main portion of the loop in collect_corrupt_items(), that could end up corrupting the table (via HeapTupleSatisfiesVacuum()). But it's probably safe, because the vm fork doesn't exist for anything but heap/toastrelations. 2) GetOldestXmin() currently specifies a relation, which can cause trouble in recovery: /* * If we're not computing a relation specific limit, or if a shared * relation has been passed in, backends in all databaseshave to be * considered. */allDbs = rel == NULL || rel->rd_rel->relisshared; /* Cannot look for individual databases during recovery */Assert(allDbs || !RecoveryInProgress()); I think that needs to be fixed. 3) Harmless here, but I think it's bad policy to release locks on normal relations before the end of xact. + relation_close(rel, AccessShareLock); + i.e. we'll Assert out. 4) + if (check_visible) + { + HTSV_Result state; + + state = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buffer); + if (state != HEAPTUPLE_LIVE || + !HeapTupleHeaderXminCommitted(tuple.t_data)) + record_corrupt_item(items, &tuple.t_data->t_ctid); + else This theoretically could give false positives, if GetOldestXmin() went backwards. But I think that's ok. 5) There's a bunch of whitespace damage in the diff, like Oid relid = PG_GETARG_OID(0); - MemoryContext oldcontext; + MemoryContext oldcontext; Otherwise this looks good. I played with it for a while, and besides finding intentionally caused corruption, it didn't flag anything (besides crashing on a standby, as in 2)). Greetings, Andres Freund
On 2016-06-09 19:33:52 -0700, Andres Freund wrote: > I played with it for a while, and besides > finding intentionally caused corruption, it didn't flag anything > (besides crashing on a standby, as in 2)). Ugh. Just sends after I sent that email: oid | t_ctid ------------------+--------------pgbench_accounts | (889641,33)pgbench_accounts | (893854,56)pgbench_accounts | (924226,13)pgbench_accounts| (1073457,51)pgbench_accounts | (1084904,16)pgbench_accounts | (1111996,26) (6 rows) oid | t_ctid -----+-------- (0 rows) oid | t_ctid ------------------+--------------pgbench_accounts | (739198,13)pgbench_accounts | (887254,11)pgbench_accounts | (1050391,6)pgbench_accounts| (1158640,46)pgbench_accounts | (1238067,18)pgbench_accounts | (1273282,22)pgbench_accounts |(1355816,54)pgbench_accounts | (1361880,33) (8 rows) oid | t_ctid -----+-------- (0 rows) Seems to be correlated with a concurrent vacuum, but it's hard to tell, because I didn't have psql output a timestamp. Greetings, Andres Freund
On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-06-09 19:33:52 -0700, Andres Freund wrote:
> I played with it for a while, and besides
> finding intentionally caused corruption, it didn't flag anything
> (besides crashing on a standby, as in 2)).
Ugh. Just sends after I sent that email:
oid | t_ctid
------------------+--------------
pgbench_accounts | (889641,33)
pgbench_accounts | (893854,56)
pgbench_accounts | (924226,13)
pgbench_accounts | (1073457,51)
pgbench_accounts | (1084904,16)
pgbench_accounts | (1111996,26)
(6 rows)
oid | t_ctid
-----+--------
(0 rows)
oid | t_ctid
------------------+--------------
pgbench_accounts | (739198,13)
pgbench_accounts | (887254,11)
pgbench_accounts | (1050391,6)
pgbench_accounts | (1158640,46)
pgbench_accounts | (1238067,18)
pgbench_accounts | (1273282,22)
pgbench_accounts | (1355816,54)
pgbench_accounts | (1361880,33)
(8 rows)
Is this output of pg_check_visible() or pg_check_frozen()?
On June 9, 2016 7:46:06 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote: >On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de> >wrote: > >> On 2016-06-09 19:33:52 -0700, Andres Freund wrote: >> > I played with it for a while, and besides >> > finding intentionally caused corruption, it didn't flag anything >> > (besides crashing on a standby, as in 2)). >> >> Ugh. Just sends after I sent that email: >> >> oid | t_ctid >> ------------------+-------------- >> pgbench_accounts | (889641,33) >> pgbench_accounts | (893854,56) >> pgbench_accounts | (924226,13) >> pgbench_accounts | (1073457,51) >> pgbench_accounts | (1084904,16) >> pgbench_accounts | (1111996,26) >> (6 rows) >> >> oid | t_ctid >> -----+-------- >> (0 rows) >> >> oid | t_ctid >> ------------------+-------------- >> pgbench_accounts | (739198,13) >> pgbench_accounts | (887254,11) >> pgbench_accounts | (1050391,6) >> pgbench_accounts | (1158640,46) >> pgbench_accounts | (1238067,18) >> pgbench_accounts | (1273282,22) >> pgbench_accounts | (1355816,54) >> pgbench_accounts | (1361880,33) >> (8 rows) >> >> >Is this output of pg_check_visible() or pg_check_frozen()? Unfortunately I don't know. I was running a union of both, I didn't really expect to hit an issue... I guess I'll put a PANICin the relevant places and check whether I cab reproduce. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Jun 10, 2016 at 8:27 AM, Andres Freund <andres@anarazel.de> wrote:
Unfortunately I don't know. I was running a union of both, I didn't really expect to hit an issue... I guess I'll put a PANIC in the relevant places and check whether I cab reproduce.
On June 9, 2016 7:46:06 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de>
>wrote:
>
>> On 2016-06-09 19:33:52 -0700, Andres Freund wrote:
>> > I played with it for a while, and besides
>> > finding intentionally caused corruption, it didn't flag anything
>> > (besides crashing on a standby, as in 2)).
>>
>> Ugh. Just sends after I sent that email:
>>
>> oid | t_ctid
>> ------------------+--------------
>> pgbench_accounts | (889641,33)
>> pgbench_accounts | (893854,56)
>> pgbench_accounts | (924226,13)
>> pgbench_accounts | (1073457,51)
>> pgbench_accounts | (1084904,16)
>> pgbench_accounts | (1111996,26)
>> (6 rows)
>>
>> oid | t_ctid
>> -----+--------
>> (0 rows)
>>
>> oid | t_ctid
>> ------------------+--------------
>> pgbench_accounts | (739198,13)
>> pgbench_accounts | (887254,11)
>> pgbench_accounts | (1050391,6)
>> pgbench_accounts | (1158640,46)
>> pgbench_accounts | (1238067,18)
>> pgbench_accounts | (1273282,22)
>> pgbench_accounts | (1355816,54)
>> pgbench_accounts | (1361880,33)
>> (8 rows)
>>
>>
>Is this output of pg_check_visible() or pg_check_frozen()?
I have tried in multiple ways by running pgbench with read-write tests, but could not see any such behaviour. I have tried by even crashing and restarting the server and then again running pgbench. Do you see these records on master or slave?
While looking at code in this area, I observed that during replay of records (heap_xlog_delete), we first clear the vm, then update the page. So we don't have Buffer lock while updating the vm where as in the patch (collect_corrupt_items()), we are relying on the fact that for clearing vm bit one needs to acquire buffer lock. Can that cause a problem?
On 2016-06-10 11:58:26 +0530, Amit Kapila wrote: > I have tried in multiple ways by running pgbench with read-write tests, but > could not see any such behaviour. It took over an hour of pgbench on a fast laptop till I saw it. > I have tried by even crashing and > restarting the server and then again running pgbench. Do you see these > records on master or slave? Master, but with an existing standby. So it could be related to hot_standby_feedback or such. > While looking at code in this area, I observed that during replay of > records (heap_xlog_delete), we first clear the vm, then update the page. > So we don't have Buffer lock while updating the vm where as in the patch > (collect_corrupt_items()), we are relying on the fact that for clearing vm > bit one needs to acquire buffer lock. Can that cause a problem? Unsetting a vm bit is always safe, right? The invariant is that the VM may never falsely say all_visible/frozen, but it's perfectly ok for a page to be all_visible/frozen, without the VM bit being present. Andres
On Thu, Jun 9, 2016 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
>
> 2. The all-visible checks seemed to me to be incorrect and incomplete.
> I made the check match the logic in lazy_scan_heap.
>
>
>
>
> 2. The all-visible checks seemed to me to be incorrect and incomplete.
> I made the check match the logic in lazy_scan_heap.
>
Okay, I thought we just want to check for dead-tuples. If we want the logic similar to lazy_scan_heap(), then I think we should also consider applying snapshot old threshold limit to oldestxmin. We currently do that in vacuum_set_xid_limits() for Vacuum. Is there a reason for not considering it for visibility check function?
On Fri, Jun 10, 2016 at 1:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 9, 2016 at 5:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Attached patch implements the above 2 functions. I have addressed the >> comments by Sawada San and you in latest patch and updated the documentation >> as well. > > I made a number of changes to this patch. Here is the new version. > > 1. The algorithm you were using for growing the array size is unsafe > and can easily overrun the array. Suppose that each of the first two > pages have some corrupt tuples, more than 50% of MaxHeapTuplesPerPage > but less than the full value of MaxTuplesPerPage. Your code will > conclude that the array does need to be enlarged after processing the > first page. I switched this to what I consider the normal coding > pattern for such problems. > > 2. The all-visible checks seemed to me to be incorrect and incomplete. > I made the check match the logic in lazy_scan_heap. > > 3. Your 1.0 -> 1.1 upgrade script was missing copies of the REVOKE > statements you added to the 1.1 script. I added them. > > 4. The tests as written were not safe under concurrency; they could > return spurious results if the page changed between the time you > checked the visibility map and the time you actually examined the > tuples. I think people will try running these functions on live > systems, so I changed the code to recheck the VM bits after locking > the page. Unfortunately, there's either still a concurrency-related > problem here or there's a bug in the all-frozen code itself because I > once managed to get pg_check_frozen('pgbench_accounts') to return a > TID while pgbench was running concurrently. That's a bit alarming, > but since I can't reproduce it I don't really have a clue how to track > down the problem. > > 5. I made various cosmetic improvements. > > If there are not objections, I will go ahead and commit this tomorrow, > because even if there is a bug (see point #4 above) I think it's > better to have this in the tree than not. However, code review and/or > testing with these new functions seems like it would be an extremely > good idea. > Thank you for working on this. Here are some minor comments. --- +/* + * Return the TIDs of not-all-visible tuples in pages marked all-visible If there is even one non-visible tuple in pages marked all-visible, the database might be corrupted. Is it better "not-visible" or "non-visible" instead of "not-all-visible"? --- Do we need to check page header flag? I think that database also might be corrupt in case where there is non-visible tuple in page set PD_ALL_VISIBLE. We could emit the WARNING log in such case. Also, using attached tool which allows us to set spurious visibility map status without actual modifying the tuple , I manually made the some situations where database is corrupted and tested it, but ISTM that this tool works fine. It doesn't mean proposing as a new feature of course, but please use it as appropriate. Regards, -- Masahiko Sawada
Attachment
On Fri, Jun 10, 2016 at 12:09 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:
>
>
> > While looking at code in this area, I observed that during replay of
> > records (heap_xlog_delete), we first clear the vm, then update the page.
> > So we don't have Buffer lock while updating the vm where as in the patch
> > (collect_corrupt_items()), we are relying on the fact that for clearing vm
> > bit one needs to acquire buffer lock. Can that cause a problem?
>
> Unsetting a vm bit is always safe, right?
>
> On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:
>
>
> > While looking at code in this area, I observed that during replay of
> > records (heap_xlog_delete), we first clear the vm, then update the page.
> > So we don't have Buffer lock while updating the vm where as in the patch
> > (collect_corrupt_items()), we are relying on the fact that for clearing vm
> > bit one needs to acquire buffer lock. Can that cause a problem?
>
> Unsetting a vm bit is always safe, right?
>
I think so, which means this should not be a problem area.
On 2016-06-09 23:39:24 -0700, Andres Freund wrote: > On 2016-06-10 11:58:26 +0530, Amit Kapila wrote: > > I have tried in multiple ways by running pgbench with read-write tests, but > > could not see any such behaviour. > > It took over an hour of pgbench on a fast laptop till I saw it. > > > > I have tried by even crashing and > > restarting the server and then again running pgbench. Do you see these > > records on master or slave? > > Master, but with an existing standby. So it could be related to > hot_standby_feedback or such. I just managed to trigger it again. #1 0x00007fa1a73778da in __GI_abort () at abort.c:89 #2 0x00007f9f1395e59c in record_corrupt_item (items=items@entry=0x2137be0, tid=0x7f9fb8681c0c) at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:612 #3 0x00007f9f1395ead5 in collect_corrupt_items (relid=relid@entry=29449, all_visible=all_visible@entry=0 '\000', all_frozen=all_frozen@entry=1'\001') at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:572 #4 0x00007f9f1395f476 in pg_check_frozen (fcinfo=0x7ffe5343a200) at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:292 #5 0x00000000005fdbec in ExecMakeTableFunctionResult (funcexpr=0x2168630, econtext=0x2168320, argContext=<optimized out>,expectedDesc=0x2168ef0, randomAccess=0 '\000') at /home/andres/src/postgresql/src/backend/executor/execQual.c:2211 #6 0x0000000000616992 in FunctionNext (node=node@entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:94 #7 0x00000000005ffdcb in ExecScanFetch (recheckMtd=0x6166f0 <FunctionRecheck>, accessMtd=0x616700 <FunctionNext>, node=0x2168210) at /home/andres/src/postgresql/src/backend/executor/execScan.c:95 #8 ExecScan (node=node@entry=0x2168210, accessMtd=accessMtd@entry=0x616700 <FunctionNext>, recheckMtd=recheckMtd@entry=0x6166f0<FunctionRecheck>) at /home/andres/src/postgresql/src/backend/executor/execScan.c:145 #9 0x00000000006169e4 in ExecFunctionScan (node=node@entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:268 the error happened just after I restarted a standby, so it's not unlikely to be related to hot_standby_feedback. (gdb) p *tuple.t_data $5 = {t_choice = {t_heap = {t_xmin = 9105470, t_xmax = 26049273, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_= 9105470, datum_typmod = 26049273, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 1, bi_lo = 19765},ip_posid = 3}, t_infomask2 = 4, t_infomask = 770, t_hoff = 24 '\030', t_bits = 0x7f9fb8681c17 ""} Infomask is: #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */ #define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID) #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */ This indeed looks borked. Such a tuple should never survive if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer)) check_frozen = false; especially not when (gdb) p PageIsAllVisible(page) $3 = 4 (fwiw, checking PD_ALL_VISIBLE in those functions sounds like a good plan) I've got another earlier case (that I somehow missed seeing), below check_visible: (gdb) p *tuple->t_data $2 = {t_choice = {t_heap = {t_xmin = 13616549, t_xmax = 25210801, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_= 13616549, datum_typmod = 25210801, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 0, bi_lo = 52320},ip_posid = 67}, t_infomask2 = 32772, t_infomask = 8962, t_hoff = 24 '\030', t_bits = 0x7f9fda2f8717 ""} infomask is: #define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */ #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */ #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */ infomask2 is: #define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */ I'll run again, with a debugger attached, maybe I can get some more information. Regards, Andres
On Fri, Jun 10, 2016 at 1:59 PM, Andres Freund <andres@anarazel.de> wrote: >> Master, but with an existing standby. So it could be related to >> hot_standby_feedback or such. > > I just managed to trigger it again. > > > #1 0x00007fa1a73778da in __GI_abort () at abort.c:89 > #2 0x00007f9f1395e59c in record_corrupt_item (items=items@entry=0x2137be0, tid=0x7f9fb8681c0c) > at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:612 > #3 0x00007f9f1395ead5 in collect_corrupt_items (relid=relid@entry=29449, all_visible=all_visible@entry=0 '\000', all_frozen=all_frozen@entry=1'\001') > at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:572 > #4 0x00007f9f1395f476 in pg_check_frozen (fcinfo=0x7ffe5343a200) at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:292 > #5 0x00000000005fdbec in ExecMakeTableFunctionResult (funcexpr=0x2168630, econtext=0x2168320, argContext=<optimized out>,expectedDesc=0x2168ef0, > randomAccess=0 '\000') at /home/andres/src/postgresql/src/backend/executor/execQual.c:2211 > #6 0x0000000000616992 in FunctionNext (node=node@entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:94 > #7 0x00000000005ffdcb in ExecScanFetch (recheckMtd=0x6166f0 <FunctionRecheck>, accessMtd=0x616700 <FunctionNext>, node=0x2168210) > at /home/andres/src/postgresql/src/backend/executor/execScan.c:95 > #8 ExecScan (node=node@entry=0x2168210, accessMtd=accessMtd@entry=0x616700 <FunctionNext>, recheckMtd=recheckMtd@entry=0x6166f0<FunctionRecheck>) > at /home/andres/src/postgresql/src/backend/executor/execScan.c:145 > #9 0x00000000006169e4 in ExecFunctionScan (node=node@entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:268 > > the error happened just after I restarted a standby, so it's not > unlikely to be related to hot_standby_feedback. After some off-list discussion and debugging, Andres and I have managed to identify three issues here (so far). Two are issues in the testing, and one is a data-corrupting bug in the freeze map code. 1. pg_check_visible keeps on using the same OldestXmin for all its checks even though the real OldestXmin may advance in the meantime. This can lead to spurious problem reports: pg_check_visible() thinks that the tuple isn't all visible yet and reports it as corruption, but in reality there's no problem. 2. pg_check_visible includes the same check for heap-xmin-committed that vacuumlazy.c uses, but hint bits aren't crash safe, so this could lead to a spurious trouble report in a scenario involving a crash. 3. vacuumlazy.c includes this code: if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit, MultiXactCutoff,&frozen[nfrozen])) frozen[nfrozen++].offset = offnum; else if (heap_tuple_needs_eventual_freeze(tuple.t_data)) all_frozen = false; That's wrong, because a "true" return value from heap_prepare_freeze_tuple() means only that it has done *some* freezing work on the tuple, not that it's done all of the freezing work that will ever need to be done. So, if the tuple's xmin can be frozen and is aborted but not older than vacuum_freeze_min_age, then heap_prepare_freeze_tuple() won't free xmax, but the page will still be marked all-frozen, which is bad. I think it normally won't matter because the xmax will probably be hinted invalid anyway, since we just pruned the page which should have set hint bits everywhere, but if those hint bits were lost then we'd eventually end up with an accessible xmax pointing off into space. My first thought was to just delete the "else" but that would be bad because we'd fail to set all-frozen immediately in a lot of cases where we should. This needs a bit more thought than I have time to give it right now. (I will update on the status of this open item again no later than Monday; probably sooner.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > 3. vacuumlazy.c includes this code: > > if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit, > MultiXactCutoff, &frozen[nfrozen])) > frozen[nfrozen++].offset = offnum; > else if (heap_tuple_needs_eventual_freeze(tuple.t_data)) > all_frozen = false; > > That's wrong, because a "true" return value from > heap_prepare_freeze_tuple() means only that it has done *some* > freezing work on the tuple, not that it's done all of the freezing > work that will ever need to be done. So, if the tuple's xmin can be > frozen and is aborted but not older than vacuum_freeze_min_age, then > heap_prepare_freeze_tuple() won't free xmax, but the page will still > be marked all-frozen, which is bad. I think it normally won't matter > because the xmax will probably be hinted invalid anyway, since we just > pruned the page which should have set hint bits everywhere, but if > those hint bits were lost then we'd eventually end up with an > accessible xmax pointing off into space. Good catch. Also consider multixact freezing: if there is a long-running transaction which is a lock-only member of tuple's Xmax, and the multixact needs freezing because it's older than the multixact cutoff, we set the xmax to a new multixact which includes that old locker. See FreezeMultiXactId. > My first thought was to just delete the "else" but that would be bad > because we'd fail to set all-frozen immediately in a lot of cases > where we should. This needs a bit more thought than I have time to > give it right now. How about changing the return tuple of heap_prepare_freeze_tuple to a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing needed" -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 10, 2016 at 4:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> 3. vacuumlazy.c includes this code: >> >> if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit, >> MultiXactCutoff, &frozen[nfrozen])) >> frozen[nfrozen++].offset = offnum; >> else if (heap_tuple_needs_eventual_freeze(tuple.t_data)) >> all_frozen = false; >> >> That's wrong, because a "true" return value from >> heap_prepare_freeze_tuple() means only that it has done *some* >> freezing work on the tuple, not that it's done all of the freezing >> work that will ever need to be done. So, if the tuple's xmin can be >> frozen and is aborted but not older than vacuum_freeze_min_age, then >> heap_prepare_freeze_tuple() won't free xmax, but the page will still >> be marked all-frozen, which is bad. I think it normally won't matter >> because the xmax will probably be hinted invalid anyway, since we just >> pruned the page which should have set hint bits everywhere, but if >> those hint bits were lost then we'd eventually end up with an >> accessible xmax pointing off into space. > > Good catch. Also consider multixact freezing: if there is a > long-running transaction which is a lock-only member of tuple's Xmax, > and the multixact needs freezing because it's older than the multixact > cutoff, we set the xmax to a new multixact which includes that old > locker. See FreezeMultiXactId. > >> My first thought was to just delete the "else" but that would be bad >> because we'd fail to set all-frozen immediately in a lot of cases >> where we should. This needs a bit more thought than I have time to >> give it right now. > > How about changing the return tuple of heap_prepare_freeze_tuple to > a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing > needed" Yes, I think something like that sounds about right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jun 11, 2016 at 1:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> 3. vacuumlazy.c includes this code:
>
> if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
> MultiXactCutoff, &frozen[nfrozen]))
> frozen[nfrozen++].offset = offnum;
> else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
> all_frozen = false;
>
> That's wrong, because a "true" return value from
> heap_prepare_freeze_tuple() means only that it has done *some*
> freezing work on the tuple, not that it's done all of the freezing
> work that will ever need to be done. So, if the tuple's xmin can be
> frozen and is aborted but not older than vacuum_freeze_min_age, then
> heap_prepare_freeze_tuple() won't free xmax, but the page will still
> be marked all-frozen, which is bad.
>
>
> 3. vacuumlazy.c includes this code:
>
> if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
> MultiXactCutoff, &frozen[nfrozen]))
> frozen[nfrozen++].offset = offnum;
> else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
> all_frozen = false;
>
> That's wrong, because a "true" return value from
> heap_prepare_freeze_tuple() means only that it has done *some*
> freezing work on the tuple, not that it's done all of the freezing
> work that will ever need to be done. So, if the tuple's xmin can be
> frozen and is aborted but not older than vacuum_freeze_min_age, then
> heap_prepare_freeze_tuple() won't free xmax, but the page will still
> be marked all-frozen, which is bad.
>
To clarify, are you talking about a case where insertion has aborted? Won't in such a case all_visible flag be set to false due to return value from HeapTupleSatisfiesVacuum() and if so, later code shouldn't mark it as all_frozen?
On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> How about changing the return tuple of heap_prepare_freeze_tuple to >> a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing >> needed" > > Yes, I think something like that sounds about right. Here's a patch. I took the approach of adding a separate bool out parameter instead. I am also attaching an update of the check-visibility patch which responds to assorted review comments and adjusting it for the problems found on Friday which could otherwise lead to false positives. I'm still getting occasional TIDs from the pg_check_visible() function during pgbench runs, though, so evidently not all is well with the world. (Official status update: I'm hoping that senior hackers will carefully review these patches for defects. If they do not, I plan to commit the patches anyway neither less than 48 nor more than 60 hours from now after re-reviewing them myself.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On June 13, 2016 11:02:42 AM CDT, Robert Haas <robertmhaas@gmail.com> wrote: >(Official status update: I'm hoping that senior hackers will carefully >review these patches for defects. If they do not, I plan to commit >the patches anyway neither less than 48 nor more than 60 hours from >now after re-reviewing them myself.) I'm traveling today and tomorrow, but will look after that. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> How about changing the return tuple of heap_prepare_freeze_tuple to >>> a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing >>> needed" >> >> Yes, I think something like that sounds about right. > > Here's a patch. I took the approach of adding a separate bool out > parameter instead. I am also attaching an update of the > check-visibility patch which responds to assorted review comments and > adjusting it for the problems found on Friday which could otherwise > lead to false positives. I'm still getting occasional TIDs from the > pg_check_visible() function during pgbench runs, though, so evidently > not all is well with the world. I'm still working out how half this stuff works, but I managed to get pg_check_visible() to spit out a row every few seconds with the following brute force approach: CREATE TABLE foo (n int); INSERT INTO foo SELECT generate_series(1, 100000); Three client threads (see attached script): 1. Run VACUUM in a tight loop. 2. Run UPDATE foo SET n = n + 1 in a tight loop. 3. Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and print out any rows it produces. I noticed that the tuples that it reported were always offset 1 in a page, and that the page always had a maxoff over a couple of hundred, and that we called record_corrupt_item because VM_ALL_VISIBLE returned true but HeapTupleSatisfiesVacuum on the first tuple returned HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. It did that because HEAP_XMAX_COMMITTED was not set and TransactionIdIsInProgress returned true for xmax. Not sure how much of this was already obvious! I will poke at it some more tomorrow. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Tue, Jun 14, 2016 at 2:53 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> How about changing the return tuple of heap_prepare_freeze_tuple to >>>> a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing >>>> needed" >>> >>> Yes, I think something like that sounds about right. >> >> Here's a patch. I took the approach of adding a separate bool out >> parameter instead. I am also attaching an update of the >> check-visibility patch which responds to assorted review comments and >> adjusting it for the problems found on Friday which could otherwise >> lead to false positives. I'm still getting occasional TIDs from the >> pg_check_visible() function during pgbench runs, though, so evidently >> not all is well with the world. > > I'm still working out how half this stuff works, but I managed to get > pg_check_visible() to spit out a row every few seconds with the > following brute force approach: > > CREATE TABLE foo (n int); > INSERT INTO foo SELECT generate_series(1, 100000); > > Three client threads (see attached script): > 1. Run VACUUM in a tight loop. > 2. Run UPDATE foo SET n = n + 1 in a tight loop. > 3. Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and > print out any rows it produces. > > I noticed that the tuples that it reported were always offset 1 in a > page, and that the page always had a maxoff over a couple of hundred, > and that we called record_corrupt_item because VM_ALL_VISIBLE returned > true but HeapTupleSatisfiesVacuum on the first tuple returned > HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. > It did that because HEAP_XMAX_COMMITTED was not set and > TransactionIdIsInProgress returned true for xmax. So this seems like it might be a visibility map bug rather than a bug in the test code, but I'm not completely sure of that. How was it legitimate to mark the page as all-visible if a tuple on the page still had a live xmax? If xmax is live and not just a locker then the tuple is not visible to the transaction that wrote xmax, at least. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 14, 2016 at 8:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 14, 2016 at 2:53 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>>> How about changing the return tuple of heap_prepare_freeze_tuple to >>>>> a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing >>>>> needed" >>>> >>>> Yes, I think something like that sounds about right. >>> >>> Here's a patch. I took the approach of adding a separate bool out >>> parameter instead. I am also attaching an update of the >>> check-visibility patch which responds to assorted review comments and >>> adjusting it for the problems found on Friday which could otherwise >>> lead to false positives. I'm still getting occasional TIDs from the >>> pg_check_visible() function during pgbench runs, though, so evidently >>> not all is well with the world. >> >> I'm still working out how half this stuff works, but I managed to get >> pg_check_visible() to spit out a row every few seconds with the >> following brute force approach: >> >> CREATE TABLE foo (n int); >> INSERT INTO foo SELECT generate_series(1, 100000); >> >> Three client threads (see attached script): >> 1. Run VACUUM in a tight loop. >> 2. Run UPDATE foo SET n = n + 1 in a tight loop. >> 3. Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and >> print out any rows it produces. >> >> I noticed that the tuples that it reported were always offset 1 in a >> page, and that the page always had a maxoff over a couple of hundred, >> and that we called record_corrupt_item because VM_ALL_VISIBLE returned >> true but HeapTupleSatisfiesVacuum on the first tuple returned >> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. >> It did that because HEAP_XMAX_COMMITTED was not set and >> TransactionIdIsInProgress returned true for xmax. > > So this seems like it might be a visibility map bug rather than a bug > in the test code, but I'm not completely sure of that. How was it > legitimate to mark the page as all-visible if a tuple on the page > still had a live xmax? If xmax is live and not just a locker then the > tuple is not visible to the transaction that wrote xmax, at least. Ah, wait a minute. I see how this could happen. Hang on, let me update the pg_visibility patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I noticed that the tuples that it reported were always offset 1 in a >>> page, and that the page always had a maxoff over a couple of hundred, >>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned >>> true but HeapTupleSatisfiesVacuum on the first tuple returned >>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. >>> It did that because HEAP_XMAX_COMMITTED was not set and >>> TransactionIdIsInProgress returned true for xmax. >> >> So this seems like it might be a visibility map bug rather than a bug >> in the test code, but I'm not completely sure of that. How was it >> legitimate to mark the page as all-visible if a tuple on the page >> still had a live xmax? If xmax is live and not just a locker then the >> tuple is not visible to the transaction that wrote xmax, at least. > > Ah, wait a minute. I see how this could happen. Hang on, let me > update the pg_visibility patch. The problem should be fixed in the attached revision of pg_check_visible. I think what happened is: 1. pg_check_visible computed an OldestXmin. 2. Some transaction committed. 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it. 4. pg_check_visible then used its older OldestXmin to check the visibility of tuples on that page, and saw delete-in-progress as a result. I added a guard against a similar scenario involving xmin in the last version of this patch, but forgot that we need to protect xmax in the same way. With this version of the patch, I can no longer get any TIDs to pop up out of pg_check_visible in my testing. (I haven't run your test script for lack of the proper Python environment...) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> I noticed that the tuples that it reported were always offset 1 in a >>>> page, and that the page always had a maxoff over a couple of hundred, >>>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned >>>> true but HeapTupleSatisfiesVacuum on the first tuple returned >>>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. >>>> It did that because HEAP_XMAX_COMMITTED was not set and >>>> TransactionIdIsInProgress returned true for xmax. >>> >>> So this seems like it might be a visibility map bug rather than a bug >>> in the test code, but I'm not completely sure of that. How was it >>> legitimate to mark the page as all-visible if a tuple on the page >>> still had a live xmax? If xmax is live and not just a locker then the >>> tuple is not visible to the transaction that wrote xmax, at least. >> >> Ah, wait a minute. I see how this could happen. Hang on, let me >> update the pg_visibility patch. > > The problem should be fixed in the attached revision of > pg_check_visible. I think what happened is: > > 1. pg_check_visible computed an OldestXmin. > 2. Some transaction committed. > 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it. > 4. pg_check_visible then used its older OldestXmin to check the > visibility of tuples on that page, and saw delete-in-progress as a > result. > > I added a guard against a similar scenario involving xmin in the last > version of this patch, but forgot that we need to protect xmax in the > same way. With this version of the patch, I can no longer get any > TIDs to pop up out of pg_check_visible in my testing. (I haven't run > your test script for lack of the proper Python environment...) I can still reproduce the problem with this new patch. What I see is that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax are all the same. -- Thomas Munro http://www.enterprisedb.com
On Wed, Jun 15, 2016 at 11:43 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>>> I noticed that the tuples that it reported were always offset 1 in a >>>>> page, and that the page always had a maxoff over a couple of hundred, >>>>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned >>>>> true but HeapTupleSatisfiesVacuum on the first tuple returned >>>>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. >>>>> It did that because HEAP_XMAX_COMMITTED was not set and >>>>> TransactionIdIsInProgress returned true for xmax. >>>> >>>> So this seems like it might be a visibility map bug rather than a bug >>>> in the test code, but I'm not completely sure of that. How was it >>>> legitimate to mark the page as all-visible if a tuple on the page >>>> still had a live xmax? If xmax is live and not just a locker then the >>>> tuple is not visible to the transaction that wrote xmax, at least. >>> >>> Ah, wait a minute. I see how this could happen. Hang on, let me >>> update the pg_visibility patch. >> >> The problem should be fixed in the attached revision of >> pg_check_visible. I think what happened is: >> >> 1. pg_check_visible computed an OldestXmin. >> 2. Some transaction committed. >> 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it. >> 4. pg_check_visible then used its older OldestXmin to check the >> visibility of tuples on that page, and saw delete-in-progress as a >> result. >> >> I added a guard against a similar scenario involving xmin in the last >> version of this patch, but forgot that we need to protect xmax in the >> same way. With this version of the patch, I can no longer get any >> TIDs to pop up out of pg_check_visible in my testing. (I haven't run >> your test script for lack of the proper Python environment...) > > I can still reproduce the problem with this new patch. What I see is > that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax > are all the same. I spent some time chasing down the exact circumstances. I suspect that there may be an interlocking problem in heap_update. Using the line numbers from cae1c788 [1], I see the following interaction between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all in reference to the same block number: [VACUUM] sets all visible bit [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple); [UPDATE] heapam.c:3938 LockBuffer(buffer,BUFFER_LOCK_UNLOCK); [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); [SELECT] observes VM_ALL_VISIBLE as true [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESSstate [SELECT] barfs [UPDATE] heapam.c:4116 visibilitymap_clear(...) [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;hb=cae1c788b9b43887e4a4fa51a11c3a8ffa334939 -- Thomas Munro http://www.enterprisedb.com
On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I spent some time chasing down the exact circumstances. I suspect > that there may be an interlocking problem in heap_update. Using the > line numbers from cae1c788 [1], I see the following interaction > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all > in reference to the same block number: > > [VACUUM] sets all visible bit > > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple); > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); > [SELECT] observes VM_ALL_VISIBLE as true > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state > [SELECT] barfs > > [UPDATE] heapam.c:4116 visibilitymap_clear(...) Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, and CTID without logging anything or clearing the all-visible flag and then releases the lock on the heap page to go do some more work that might even ERROR out. Only if that other work all goes OK do we relock the page and perform the WAL-logged actions. That doesn't seem like a good idea even in existing releases, because you've taken a tuple on an all-visible page and made it not all-visible, and you've made a page modification that is not necessarily atomic without logging it. This is is particularly bad in 9.6, because if that page is also all-frozen then XMAX will eventually be pointing into space and VACUUM will never visit the page to re-freeze it the way it would have done in earlier releases. However, even in older releases, I think there's a remote possibility of data corruption. Backend #1 makes these changes to the page, releases the lock, and errors out. Backend #2 writes the page to the OS. DBA takes a hot backup, tearing the page in the middle of XMAX. Oops. I'm not sure what to do about this: this part of the heap_update() logic has been like this forever, and I assume that if it were easy to refactor this away, somebody would have done it by now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I spent some time chasing down the exact circumstances. I suspect
> > that there may be an interlocking problem in heap_update. Using the
> > line numbers from cae1c788 [1], I see the following interaction
> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> > in reference to the same block number:
> >
> > [VACUUM] sets all visible bit
> >
> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >
> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> > [SELECT] observes VM_ALL_VISIBLE as true
> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> > [SELECT] barfs
> >
> > [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.
>
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I spent some time chasing down the exact circumstances. I suspect
> > that there may be an interlocking problem in heap_update. Using the
> > line numbers from cae1c788 [1], I see the following interaction
> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> > in reference to the same block number:
> >
> > [VACUUM] sets all visible bit
> >
> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >
> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> > [SELECT] observes VM_ALL_VISIBLE as true
> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> > [SELECT] barfs
> >
> > [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.
>
Can't we clear the all-visible flag before releasing the lock? We can use logic of already_marked as it is currently used in code to clear it just once.
On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> > I spent some time chasing down the exact circumstances. I suspect >> > that there may be an interlocking problem in heap_update. Using the >> > line numbers from cae1c788 [1], I see the following interaction >> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all >> > in reference to the same block number: >> > >> > [VACUUM] sets all visible bit >> > >> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, >> > xmax_old_tuple); >> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); >> > >> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); >> > [SELECT] observes VM_ALL_VISIBLE as true >> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state >> > [SELECT] barfs >> > >> > [UPDATE] heapam.c:4116 visibilitymap_clear(...) >> >> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, >> and CTID without logging anything or clearing the all-visible flag and >> then releases the lock on the heap page to go do some more work that >> might even ERROR out. > > Can't we clear the all-visible flag before releasing the lock? We can use > logic of already_marked as it is currently used in code to clear it just > once. That just kicks the can down the road. Then you have PD_ALL_VISIBLE clear but the VM bit is still set. And you still haven't WAL-logged anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 15, 2016 at 7:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> >> <thomas.munro@enterprisedb.com> wrote:
> >> > I spent some time chasing down the exact circumstances. I suspect
> >> > that there may be an interlocking problem in heap_update. Using the
> >> > line numbers from cae1c788 [1], I see the following interaction
> >> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> >> > in reference to the same block number:
> >> >
> >> > [VACUUM] sets all visible bit
> >> >
> >> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
> >> > xmax_old_tuple);
> >> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >> >
> >> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> >> > [SELECT] observes VM_ALL_VISIBLE as true
> >> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> >> > [SELECT] barfs
> >> >
> >> > [UPDATE] heapam.c:4116 visibilitymap_clear(...)
> >>
> >> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> >> and CTID without logging anything or clearing the all-visible flag and
> >> then releases the lock on the heap page to go do some more work that
> >> might even ERROR out.
> >
> > Can't we clear the all-visible flag before releasing the lock? We can use
> > logic of already_marked as it is currently used in code to clear it just
> > once.
>
> That just kicks the can down the road. Then you have PD_ALL_VISIBLE
> clear but the VM bit is still set.
I mean to say clear both as we are doing currently in code:
>
> On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> >> <thomas.munro@enterprisedb.com> wrote:
> >> > I spent some time chasing down the exact circumstances. I suspect
> >> > that there may be an interlocking problem in heap_update. Using the
> >> > line numbers from cae1c788 [1], I see the following interaction
> >> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> >> > in reference to the same block number:
> >> >
> >> > [VACUUM] sets all visible bit
> >> >
> >> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
> >> > xmax_old_tuple);
> >> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >> >
> >> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> >> > [SELECT] observes VM_ALL_VISIBLE as true
> >> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> >> > [SELECT] barfs
> >> >
> >> > [UPDATE] heapam.c:4116 visibilitymap_clear(...)
> >>
> >> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> >> and CTID without logging anything or clearing the all-visible flag and
> >> then releases the lock on the heap page to go do some more work that
> >> might even ERROR out.
> >
> > Can't we clear the all-visible flag before releasing the lock? We can use
> > logic of already_marked as it is currently used in code to clear it just
> > once.
>
> That just kicks the can down the road. Then you have PD_ALL_VISIBLE
> clear but the VM bit is still set.
I mean to say clear both as we are doing currently in code:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
>
> And you still haven't WAL-logged
> anything.
>
Yeah, I think WAL requirement is more difficult to meet and I think releasing the lock on buffer before writing WAL could lead to flush of such a buffer before WAL.
I feel this is an existing-bug and should go to Older Bugs Section in open items page.
On Wed, Jun 15, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> I spent some time chasing down the exact circumstances. I suspect >> that there may be an interlocking problem in heap_update. Using the >> line numbers from cae1c788 [1], I see the following interaction >> between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all >> in reference to the same block number: >> >> [VACUUM] sets all visible bit >> >> [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple); >> [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); >> >> [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); >> [SELECT] observes VM_ALL_VISIBLE as true >> [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state >> [SELECT] barfs >> >> [UPDATE] heapam.c:4116 visibilitymap_clear(...) > > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > and CTID without logging anything or clearing the all-visible flag and > then releases the lock on the heap page to go do some more work that > might even ERROR out. Only if that other work all goes OK do we > relock the page and perform the WAL-logged actions. > > That doesn't seem like a good idea even in existing releases, because > you've taken a tuple on an all-visible page and made it not > all-visible, and you've made a page modification that is not > necessarily atomic without logging it. This is is particularly bad in > 9.6, because if that page is also all-frozen then XMAX will eventually > be pointing into space and VACUUM will never visit the page to > re-freeze it the way it would have done in earlier releases. However, > even in older releases, I think there's a remote possibility of data > corruption. Backend #1 makes these changes to the page, releases the > lock, and errors out. Backend #2 writes the page to the OS. DBA > takes a hot backup, tearing the page in the middle of XMAX. Oops. > > I'm not sure what to do about this: this part of the heap_update() > logic has been like this forever, and I assume that if it were easy to > refactor this away, somebody would have done it by now. > How about changing collect_corrupt_items to acquire AccessExclusiveLock for safely checking? Regards, -- Masahiko Sawada
On Wed, Jun 15, 2016 at 10:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I'm not sure what to do about this: this part of the heap_update() >> logic has been like this forever, and I assume that if it were easy to >> refactor this away, somebody would have done it by now. > > How about changing collect_corrupt_items to acquire > AccessExclusiveLock for safely checking? Well, that would make it a lot less likely for pg_check_{visible,frozen} to detect the bug in heap_update(), but it wouldn't fix the bug in heap_update(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 15, 2016 at 9:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> That just kicks the can down the road. Then you have PD_ALL_VISIBLE >> clear but the VM bit is still set. > > I mean to say clear both as we are doing currently in code: > if (PageIsAllVisible(BufferGetPage(buffer))) > { > all_visible_cleared = true; > PageClearAllVisible(BufferGetPage(buffer)); > visibilitymap_clear(relation, BufferGetBlockNumber(buffer), > vmbuffer); > } Sure, but without emitting a WAL record, that's just broken. You could have the heap page get flushed to disk and the VM page not get flushed to disk, and then crash, and now you have the classic VM corruption scenario. >> And you still haven't WAL-logged >> anything. > > Yeah, I think WAL requirement is more difficult to meet and I think > releasing the lock on buffer before writing WAL could lead to flush of such > a buffer before WAL. > > I feel this is an existing-bug and should go to Older Bugs Section in open > items page. It does seem to be an existing bug, but the freeze map makes the problem more serious, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 15, 2016 at 08:56:52AM -0400, Robert Haas wrote: > On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > I spent some time chasing down the exact circumstances. I suspect > > that there may be an interlocking problem in heap_update. Using the > > line numbers from cae1c788 [1], I see the following interaction > > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all > > in reference to the same block number: > > > > [VACUUM] sets all visible bit > > > > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple); > > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > > > > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); > > [SELECT] observes VM_ALL_VISIBLE as true > > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state > > [SELECT] barfs > > > > [UPDATE] heapam.c:4116 visibilitymap_clear(...) > > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > and CTID without logging anything or clearing the all-visible flag and > then releases the lock on the heap page to go do some more work that > might even ERROR out. Only if that other work all goes OK do we > relock the page and perform the WAL-logged actions. > > That doesn't seem like a good idea even in existing releases, because > you've taken a tuple on an all-visible page and made it not > all-visible, and you've made a page modification that is not > necessarily atomic without logging it. This is is particularly bad in > 9.6, because if that page is also all-frozen then XMAX will eventually > be pointing into space and VACUUM will never visit the page to > re-freeze it the way it would have done in earlier releases. However, > even in older releases, I think there's a remote possibility of data > corruption. Backend #1 makes these changes to the page, releases the > lock, and errors out. Backend #2 writes the page to the OS. DBA > takes a hot backup, tearing the page in the middle of XMAX. Oops. I agree the non-atomic, unlogged change is a problem. A related threat doesn't require a torn page: AssignTransactionId() xid=123 heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123); some ERROR before heap_update()finishes rollback; -- xid=123 some backend flushes the modified page immediate shutdown AssignTransactionId()xid=123 commit; -- xid=123 If nothing wrote an xlog record that witnesses xid 123, the cluster can reassign it after recovery. The failed update is now considered a successful update, and the row improperly becomes dead. That's important. I don't know whether the 9.6 all-frozen mechanism materially amplifies the consequences of this bug. The interaction with visibility map and freeze map is not all bad; indeed, it can reduce the risk of experiencing consequences from the non-atomic, unlogged change bug. If the row is all-visible when heap_update() starts, every transaction should continue to consider the row visible until heap_update() finishes successfully. If an ERROR interrupts heap_update(), visibility verdicts should be as though the heap_update() never happened. If one of the previously-described mechanisms would make an xmax visibility test give the wrong answer, an all-visible bit could mask the problem for awhile. Having said that, freeze map hurts in scenarios involving toast_insert_or_update() failures and no crash recovery. Instead of VACUUM cleaning up the aborted xmax, that xmax could persist long enough for its xid to be reused in a successful transaction. When some other modification finally clears all-frozen and all-visible, the row improperly becomes dead. Both scenarios are fairly rare; I don't know which is more rare. [Disclaimer: I have not built tests cases to verify those alleged failure mechanisms.] If we made this pre-9.6 bug a 9.6 open item, would anyone volunteer to own it? Then we wouldn't need to guess whether 9.6 will be safer with the freeze map or safer without the freeze map. Thanks, nm
On 2016-06-15 08:56:52 -0400, Robert Haas wrote: > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > and CTID without logging anything or clearing the all-visible flag and > then releases the lock on the heap page to go do some more work that > might even ERROR out. Only if that other work all goes OK do we > relock the page and perform the WAL-logged actions. > > That doesn't seem like a good idea even in existing releases, because > you've taken a tuple on an all-visible page and made it not > all-visible, and you've made a page modification that is not > necessarily atomic without logging it. Right, that's broken. > I'm not sure what to do about this: this part of the heap_update() > logic has been like this forever, and I assume that if it were easy to > refactor this away, somebody would have done it by now. Well, I think generally nobody seriously looked at actually refactoring heap_update(), even though that'd be a good idea. But in this instance, the problem seems relatively fundamental: We need to lock the origin page, to do visibility checks, etc. Then we need to figure out the target page. Even disregarding toasting - which we could be doing earlier with some refactoring - we're going to have to release the page level lock, to lock them in ascending order. Otherwise we'll risk kinda likely deadlocks. We also certainly don't want to nest the lwlocks for the toast stuff, inside a content lock for the old tupe's page. So far the best idea I have - and it's really not a good one - is to invent a new hint-bit that tells concurrent updates to acquire a heavyweight tuple lock, while releasing the page-level lock. If that hint bit does not require any other modifications - and we don't need an xid in xmax for this use case - that'll avoid doing all the other `already_marked` stuff early, which should address the correctness issue. It's kinda invasive though, and probably has performance implications. Does anybody have a better idea? Regards, Andres
On Mon, Jun 20, 2016 at 3:33 PM, Andres Freund <andres@anarazel.de> wrote: >> I'm not sure what to do about this: this part of the heap_update() >> logic has been like this forever, and I assume that if it were easy to >> refactor this away, somebody would have done it by now. > > Well, I think generally nobody seriously looked at actually refactoring > heap_update(), even though that'd be a good idea. But in this instance, > the problem seems relatively fundamental: > > We need to lock the origin page, to do visibility checks, etc. Then we > need to figure out the target page. Even disregarding toasting - which > we could be doing earlier with some refactoring - we're going to have to > release the page level lock, to lock them in ascending order. Otherwise > we'll risk kinda likely deadlocks. We also certainly don't want to nest > the lwlocks for the toast stuff, inside a content lock for the old > tupe's page. > > So far the best idea I have - and it's really not a good one - is to > invent a new hint-bit that tells concurrent updates to acquire a > heavyweight tuple lock, while releasing the page-level lock. If that > hint bit does not require any other modifications - and we don't need an > xid in xmax for this use case - that'll avoid doing all the other > `already_marked` stuff early, which should address the correctness > issue. It's kinda invasive though, and probably has performance > implications. > > Does anybody have a better idea? What exactly is the point of all of that already_marked stuff? I mean, suppose we just don't do any of that before we go off to do toast_insert_or_update and RelationGetBufferForTuple. Eventually, when we reacquire the page lock, we might find that somebody else has already updated the tuple, but couldn't that be handled by (approximately) looping back up to l2 just as we do in several other cases? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-20 16:10:23 -0400, Robert Haas wrote: > What exactly is the point of all of that already_marked stuff? Preventing the old tuple from being locked/updated by another backend, while unlocking the buffer. > I > mean, suppose we just don't do any of that before we go off to do > toast_insert_or_update and RelationGetBufferForTuple. Eventually, > when we reacquire the page lock, we might find that somebody else has > already updated the tuple, but couldn't that be handled by > (approximately) looping back up to l2 just as we do in several other > cases? We'd potentially have to undo a fair amount more work: the toasted data would have to be deleted and such, just to retry. Which isn't going to super easy, because all of it will be happening with the current cid (we can't just increase CommandCounterIncrement() for correctness reasons). Andres
On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: >> What exactly is the point of all of that already_marked stuff? > > Preventing the old tuple from being locked/updated by another backend, > while unlocking the buffer. > >> I >> mean, suppose we just don't do any of that before we go off to do >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, >> when we reacquire the page lock, we might find that somebody else has >> already updated the tuple, but couldn't that be handled by >> (approximately) looping back up to l2 just as we do in several other >> cases? > > We'd potentially have to undo a fair amount more work: the toasted data > would have to be deleted and such, just to retry. Which isn't going to > super easy, because all of it will be happening with the current cid (we > can't just increase CommandCounterIncrement() for correctness reasons). Why would we have to delete the TOAST data? AFAIUI, the tuple points to the TOAST data, but not the other way around. So if we change our mind about where to put the tuple, I don't think that requires re-TOASTing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-20 17:55:19 -0400, Robert Haas wrote: > On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: > >> What exactly is the point of all of that already_marked stuff? > > > > Preventing the old tuple from being locked/updated by another backend, > > while unlocking the buffer. > > > >> I > >> mean, suppose we just don't do any of that before we go off to do > >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, > >> when we reacquire the page lock, we might find that somebody else has > >> already updated the tuple, but couldn't that be handled by > >> (approximately) looping back up to l2 just as we do in several other > >> cases? > > > > We'd potentially have to undo a fair amount more work: the toasted data > > would have to be deleted and such, just to retry. Which isn't going to > > super easy, because all of it will be happening with the current cid (we > > can't just increase CommandCounterIncrement() for correctness reasons). > > Why would we have to delete the TOAST data? AFAIUI, the tuple points > to the TOAST data, but not the other way around. So if we change our > mind about where to put the tuple, I don't think that requires > re-TOASTing. Consider what happens if we, after restarting at l2, notice that we can't actually insert, but return in the !HeapTupleMayBeUpdated branch. If the caller doesn't error out - and there's certainly callers doing that - we'd "leak" a toasted datum. Unless the transaction aborts, the toasted datum would never be cleaned up, because there's no datum pointing to it, so no heap_delete will ever recurse into the toast datum (via toast_delete()). Andres
On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch <noah@leadboat.com> wrote: > I agree the non-atomic, unlogged change is a problem. A related threat > doesn't require a torn page: > > AssignTransactionId() xid=123 > heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123); > some ERROR before heap_update() finishes > rollback; -- xid=123 > some backend flushes the modified page > immediate shutdown > AssignTransactionId() xid=123 > commit; -- xid=123 > > If nothing wrote an xlog record that witnesses xid 123, the cluster can > reassign it after recovery. The failed update is now considered a successful > update, and the row improperly becomes dead. That's important. I wonder if that was originally supposed to be handled with the HEAP_XMAX_UNLOGGED flag which was removed in 11919160. A comment in the heap WAL logging commit f2bfe8a2 said that tqual routines would see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging (though I'm not sure if the tqual routines ever actually did that). -- Thomas Munro http://www.enterprisedb.com
On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-15 08:56:52 -0400, Robert Haas wrote:
> > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> > and CTID without logging anything or clearing the all-visible flag and
> > then releases the lock on the heap page to go do some more work that
> > might even ERROR out. Only if that other work all goes OK do we
> > relock the page and perform the WAL-logged actions.
> >
> > That doesn't seem like a good idea even in existing releases, because
> > you've taken a tuple on an all-visible page and made it not
> > all-visible, and you've made a page modification that is not
> > necessarily atomic without logging it.
>
> Right, that's broken.
>
>
> > I'm not sure what to do about this: this part of the heap_update()
> > logic has been like this forever, and I assume that if it were easy to
> > refactor this away, somebody would have done it by now.
>
> Well, I think generally nobody seriously looked at actually refactoring
> heap_update(), even though that'd be a good idea. But in this instance,
> the problem seems relatively fundamental:
>
> We need to lock the origin page, to do visibility checks, etc. Then we
> need to figure out the target page. Even disregarding toasting - which
> we could be doing earlier with some refactoring - we're going to have to
> release the page level lock, to lock them in ascending order. Otherwise
> we'll risk kinda likely deadlocks.
>
> On 2016-06-15 08:56:52 -0400, Robert Haas wrote:
> > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> > and CTID without logging anything or clearing the all-visible flag and
> > then releases the lock on the heap page to go do some more work that
> > might even ERROR out. Only if that other work all goes OK do we
> > relock the page and perform the WAL-logged actions.
> >
> > That doesn't seem like a good idea even in existing releases, because
> > you've taken a tuple on an all-visible page and made it not
> > all-visible, and you've made a page modification that is not
> > necessarily atomic without logging it.
>
> Right, that's broken.
>
>
> > I'm not sure what to do about this: this part of the heap_update()
> > logic has been like this forever, and I assume that if it were easy to
> > refactor this away, somebody would have done it by now.
>
> Well, I think generally nobody seriously looked at actually refactoring
> heap_update(), even though that'd be a good idea. But in this instance,
> the problem seems relatively fundamental:
>
> We need to lock the origin page, to do visibility checks, etc. Then we
> need to figure out the target page. Even disregarding toasting - which
> we could be doing earlier with some refactoring - we're going to have to
> release the page level lock, to lock them in ascending order. Otherwise
> we'll risk kinda likely deadlocks.
>
Can we consider to use some strategy to avoid deadlocks without releasing the lock on old page? Consider if we could have a mechanism such that RelationGetBufferForTuple() will ensure that it always returns a new buffer which has targetblock greater than the old block (on which we already held a lock). I think here tricky part is whether we can get anything like that from FSM. Also, there could be cases where we need to extend the heap when there were pages in heap with space available, but we have ignored them because there block number is smaller than the block number on which we have lock.
> We also certainly don't want to nest
> the lwlocks for the toast stuff, inside a content lock for the old
> tupe's page.
>
> So far the best idea I have - and it's really not a good one - is to
> invent a new hint-bit that tells concurrent updates to acquire a
> heavyweight tuple lock, while releasing the page-level lock. If that
> hint bit does not require any other modifications - and we don't need an
> xid in xmax for this use case - that'll avoid doing all the other
> `already_marked` stuff early, which should address the correctness
> issue.
> the lwlocks for the toast stuff, inside a content lock for the old
> tupe's page.
>
> So far the best idea I have - and it's really not a good one - is to
> invent a new hint-bit that tells concurrent updates to acquire a
> heavyweight tuple lock, while releasing the page-level lock. If that
> hint bit does not require any other modifications - and we don't need an
> xid in xmax for this use case - that'll avoid doing all the other
> `already_marked` stuff early, which should address the correctness
> issue.
>
Don't we need to clear such a flag in case of error? Also don't we need to reset it later, like when modifying the old page later before WAL.
> It's kinda invasive though, and probably has performance
> implications.
>
> implications.
>
Do you see performance implication due to requirement of heavywieht tuple lock in more cases than now or something else?
Some others ways could be:
Before releasing the lock on buffer containing old tuple, clear the VM and visibility info from page and WAL log it. I think this could impact performance depending on how frequently we need to perform this action.
Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the same in old tuple header before releasing lock on buffer and teach tqual.c to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as an indication of aborted transaction unless it is currently in-progress. Also, I think we need to clear this flag before WAL logging in heap_update.
On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote: >> Well, I think generally nobody seriously looked at actually refactoring >> heap_update(), even though that'd be a good idea. But in this instance, >> the problem seems relatively fundamental: >> >> We need to lock the origin page, to do visibility checks, etc. Then we >> need to figure out the target page. Even disregarding toasting - which >> we could be doing earlier with some refactoring - we're going to have to >> release the page level lock, to lock them in ascending order. Otherwise >> we'll risk kinda likely deadlocks. > > Can we consider to use some strategy to avoid deadlocks without releasing > the lock on old page? Consider if we could have a mechanism such that > RelationGetBufferForTuple() will ensure that it always returns a new buffer > which has targetblock greater than the old block (on which we already held a > lock). I think here tricky part is whether we can get anything like that > from FSM. Also, there could be cases where we need to extend the heap when > there were pages in heap with space available, but we have ignored them > because there block number is smaller than the block number on which we have > lock. Doesn't that mean that over time, given a workload that does only or mostly updates, your records tend to migrate further and further away from the start of the file, leaving a growing unusable space at the beginning, until you eventually need to CLUSTER/VACUUM FULL? I was wondering about speculatively asking for a free page with a lower block number than the origin page, if one is available, before locking the origin page. Then after locking the origin page, if it turns out you need a page but didn't get it earlier, asking for a free page with a higher block number than the origin page. -- Thomas Munro http://www.enterprisedb.com
On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Some others ways could be: > > Before releasing the lock on buffer containing old tuple, clear the VM and > visibility info from page and WAL log it. I think this could impact > performance depending on how frequently we need to perform this action. > > Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was > introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the > same in old tuple header before releasing lock on buffer and teach tqual.c > to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as > an indication of aborted transaction unless it is currently in-progress. > Also, I think we need to clear this flag before WAL logging in heap_update. I also noticed that and wondered whether it was a mistake to take that out. It appears to have been removed as part of the logic to clear away UNDO log support in 11919160, but it may have been an important part of the heap_update protocol. Though (as I mentioned nearby in a reply to Noah) it I'm not sure if the tqual.c side which would ignore the unlogged xmax in the event of a badly timed crash was ever implemented. -- Thomas Munro http://www.enterprisedb.com
On 2016-06-21 08:59:13 +0530, Amit Kapila wrote: > Can we consider to use some strategy to avoid deadlocks without releasing > the lock on old page? Consider if we could have a mechanism such that > RelationGetBufferForTuple() will ensure that it always returns a new buffer > which has targetblock greater than the old block (on which we already held > a lock). I think here tricky part is whether we can get anything like that > from FSM. Also, there could be cases where we need to extend the heap when > there were pages in heap with space available, but we have ignored them > because there block number is smaller than the block number on which we > have lock. I can't see that being acceptable, from a space-usage POV. > > So far the best idea I have - and it's really not a good one - is to > > invent a new hint-bit that tells concurrent updates to acquire a > > heavyweight tuple lock, while releasing the page-level lock. If that > > hint bit does not require any other modifications - and we don't need an > > xid in xmax for this use case - that'll avoid doing all the other > > `already_marked` stuff early, which should address the correctness > > issue. > > > > Don't we need to clear such a flag in case of error? Also don't we need to > reset it later, like when modifying the old page later before WAL. If the flag just says "acquire a heavyweight lock", then there's no need for that. That's cheap enough to just do if it's errorneously set. At least I can't see any reason. > > It's kinda invasive though, and probably has performance > > implications. > > > > Do you see performance implication due to requirement of heavywieht tuple > lock in more cases than now or something else? Because of that, yes. > Some others ways could be: > > Before releasing the lock on buffer containing old tuple, clear the VM and > visibility info from page and WAL log it. I think this could impact > performance depending on how frequently we need to perform this action. Doubling the number of xlog inserts in heap_update would certainly be measurable :(. My guess is that the heavyweight tuple lock approach will be less expensive. Andres
On Tue, Jun 21, 2016 at 9:08 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:
> >> Well, I think generally nobody seriously looked at actually refactoring
> >> heap_update(), even though that'd be a good idea. But in this instance,
> >> the problem seems relatively fundamental:
> >>
> >> We need to lock the origin page, to do visibility checks, etc. Then we
> >> need to figure out the target page. Even disregarding toasting - which
> >> we could be doing earlier with some refactoring - we're going to have to
> >> release the page level lock, to lock them in ascending order. Otherwise
> >> we'll risk kinda likely deadlocks.
> >
> > Can we consider to use some strategy to avoid deadlocks without releasing
> > the lock on old page? Consider if we could have a mechanism such that
> > RelationGetBufferForTuple() will ensure that it always returns a new buffer
> > which has targetblock greater than the old block (on which we already held a
> > lock). I think here tricky part is whether we can get anything like that
> > from FSM. Also, there could be cases where we need to extend the heap when
> > there were pages in heap with space available, but we have ignored them
> > because there block number is smaller than the block number on which we have
> > lock.
>
> Doesn't that mean that over time, given a workload that does only or
> mostly updates, your records tend to migrate further and further away
> from the start of the file, leaving a growing unusable space at the
> beginning, until you eventually need to CLUSTER/VACUUM FULL?
> I was wondering about speculatively asking for a free page with a
> lower block number than the origin page, if one is available, before
> locking the origin page.
>
> On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:
> >> Well, I think generally nobody seriously looked at actually refactoring
> >> heap_update(), even though that'd be a good idea. But in this instance,
> >> the problem seems relatively fundamental:
> >>
> >> We need to lock the origin page, to do visibility checks, etc. Then we
> >> need to figure out the target page. Even disregarding toasting - which
> >> we could be doing earlier with some refactoring - we're going to have to
> >> release the page level lock, to lock them in ascending order. Otherwise
> >> we'll risk kinda likely deadlocks.
> >
> > Can we consider to use some strategy to avoid deadlocks without releasing
> > the lock on old page? Consider if we could have a mechanism such that
> > RelationGetBufferForTuple() will ensure that it always returns a new buffer
> > which has targetblock greater than the old block (on which we already held a
> > lock). I think here tricky part is whether we can get anything like that
> > from FSM. Also, there could be cases where we need to extend the heap when
> > there were pages in heap with space available, but we have ignored them
> > because there block number is smaller than the block number on which we have
> > lock.
>
> Doesn't that mean that over time, given a workload that does only or
> mostly updates, your records tend to migrate further and further away
> from the start of the file, leaving a growing unusable space at the
> beginning, until you eventually need to CLUSTER/VACUUM FULL?
>
The request for updates should ideally fit in same page as old tuple for many of the cases if fillfactor is properly configured, considering update-mostly loads. Why would it be that always the records will migrate further away, they should get the space freed by other updates in intermediate pages. I think there could be some impact space-wise, but freed-up space will be eventually used.
> I was wondering about speculatively asking for a free page with a
> lower block number than the origin page, if one is available, before
> locking the origin page.
Do you wan't to lock it as well? In any-case, I think adding the code without deciding whether the update can be accommodated in current page can prove to be costly.
> Then after locking the origin page, if it
> turns out you need a page but didn't get it earlier, asking for a free
> page with a higher block number than the origin page.
>
Something like that might workout if it is feasible and people agree on pursuing such an approach.
> turns out you need a page but didn't get it earlier, asking for a free
> page with a higher block number than the origin page.
>
Something like that might workout if it is feasible and people agree on pursuing such an approach.
On Tue, Jun 21, 2016 at 9:16 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Some others ways could be:
> >
> > Before releasing the lock on buffer containing old tuple, clear the VM and
> > visibility info from page and WAL log it. I think this could impact
> > performance depending on how frequently we need to perform this action.
> >
> > Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was
> > introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the
> > same in old tuple header before releasing lock on buffer and teach tqual.c
> > to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as
> > an indication of aborted transaction unless it is currently in-progress.
> > Also, I think we need to clear this flag before WAL logging in heap_update.
>
> I also noticed that and wondered whether it was a mistake to take that
> out. It appears to have been removed as part of the logic to clear
> away UNDO log support in 11919160, but it may have been an important
> part of the heap_update protocol. Though (as I mentioned nearby in a
> reply to Noah) it I'm not sure if the tqual.c side which would ignore
> the unlogged xmax in the event of a badly timed crash was ever
> implemented.
>
>
> On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Some others ways could be:
> >
> > Before releasing the lock on buffer containing old tuple, clear the VM and
> > visibility info from page and WAL log it. I think this could impact
> > performance depending on how frequently we need to perform this action.
> >
> > Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was
> > introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the
> > same in old tuple header before releasing lock on buffer and teach tqual.c
> > to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as
> > an indication of aborted transaction unless it is currently in-progress.
> > Also, I think we need to clear this flag before WAL logging in heap_update.
>
> I also noticed that and wondered whether it was a mistake to take that
> out. It appears to have been removed as part of the logic to clear
> away UNDO log support in 11919160, but it may have been an important
> part of the heap_update protocol. Though (as I mentioned nearby in a
> reply to Noah) it I'm not sure if the tqual.c side which would ignore
> the unlogged xmax in the event of a badly timed crash was ever
> implemented.
>
Right, my observation is similar to yours and that's what I am suggesting as one-alternative to fix this issue. I think making this approach work (even if this doesn't have any problems) might turn out to be tricky. However, the plus-point of this approach seems to be that it shouldn't impact performance in most of the cases.
On Tue, Jun 21, 2016 at 9:21 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-21 08:59:13 +0530, Amit Kapila wrote:
> > Can we consider to use some strategy to avoid deadlocks without releasing
> > the lock on old page? Consider if we could have a mechanism such that
> > RelationGetBufferForTuple() will ensure that it always returns a new buffer
> > which has targetblock greater than the old block (on which we already held
> > a lock). I think here tricky part is whether we can get anything like that
> > from FSM. Also, there could be cases where we need to extend the heap when
> > there were pages in heap with space available, but we have ignored them
> > because there block number is smaller than the block number on which we
> > have lock.
>
> I can't see that being acceptable, from a space-usage POV.
>
> > > So far the best idea I have - and it's really not a good one - is to
> > > invent a new hint-bit that tells concurrent updates to acquire a
> > > heavyweight tuple lock, while releasing the page-level lock. If that
> > > hint bit does not require any other modifications - and we don't need an
> > > xid in xmax for this use case - that'll avoid doing all the other
> > > `already_marked` stuff early, which should address the correctness
> > > issue.
> > >
> >
> > Don't we need to clear such a flag in case of error? Also don't we need to
> > reset it later, like when modifying the old page later before WAL.
>
> If the flag just says "acquire a heavyweight lock", then there's no need
> for that. That's cheap enough to just do if it's errorneously set. At
> least I can't see any reason.
>
> > > It's kinda invasive though, and probably has performance
> > > implications.
> > >
> >
> > Do you see performance implication due to requirement of heavywieht tuple
> > lock in more cases than now or something else?
>
> Because of that, yes.
>
>
> > Some others ways could be:
> >
> > Before releasing the lock on buffer containing old tuple, clear the VM and
> > visibility info from page and WAL log it. I think this could impact
> > performance depending on how frequently we need to perform this action.
>
> Doubling the number of xlog inserts in heap_update would certainly be
> measurable :(. My guess is that the heavyweight tuple lock approach will
> be less expensive.
>
>
> On 2016-06-21 08:59:13 +0530, Amit Kapila wrote:
> > Can we consider to use some strategy to avoid deadlocks without releasing
> > the lock on old page? Consider if we could have a mechanism such that
> > RelationGetBufferForTuple() will ensure that it always returns a new buffer
> > which has targetblock greater than the old block (on which we already held
> > a lock). I think here tricky part is whether we can get anything like that
> > from FSM. Also, there could be cases where we need to extend the heap when
> > there were pages in heap with space available, but we have ignored them
> > because there block number is smaller than the block number on which we
> > have lock.
>
> I can't see that being acceptable, from a space-usage POV.
>
> > > So far the best idea I have - and it's really not a good one - is to
> > > invent a new hint-bit that tells concurrent updates to acquire a
> > > heavyweight tuple lock, while releasing the page-level lock. If that
> > > hint bit does not require any other modifications - and we don't need an
> > > xid in xmax for this use case - that'll avoid doing all the other
> > > `already_marked` stuff early, which should address the correctness
> > > issue.
> > >
> >
> > Don't we need to clear such a flag in case of error? Also don't we need to
> > reset it later, like when modifying the old page later before WAL.
>
> If the flag just says "acquire a heavyweight lock", then there's no need
> for that. That's cheap enough to just do if it's errorneously set. At
> least I can't see any reason.
>
I think it will just increase the chances of other backends to acquire a heavy weight lock.
> > > It's kinda invasive though, and probably has performance
> > > implications.
> > >
> >
> > Do you see performance implication due to requirement of heavywieht tuple
> > lock in more cases than now or something else?
>
> Because of that, yes.
>
>
> > Some others ways could be:
> >
> > Before releasing the lock on buffer containing old tuple, clear the VM and
> > visibility info from page and WAL log it. I think this could impact
> > performance depending on how frequently we need to perform this action.
>
> Doubling the number of xlog inserts in heap_update would certainly be
> measurable :(. My guess is that the heavyweight tuple lock approach will
> be less expensive.
>
Probably, but I think heavyweight tuple lock is more invasive. I think increasing the number of xlog inserts could surely impact performance, but depending upon how frequently we need to call it. I think we might want to combine it with the idea of RelationGetBufferForTuple() to return higher-block number, such that if we don't find higher block-number from FSM, then we can release the lock on old page and try to get the locks on old and new buffers as we do now. This will further reduce the chances of increasing xlog insert calls and address the issue of space-wastage.
On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-20 17:55:19 -0400, Robert Haas wrote: >> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: >> >> I >> >> mean, suppose we just don't do any of that before we go off to do >> >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, >> >> when we reacquire the page lock, we might find that somebody else has >> >> already updated the tuple, but couldn't that be handled by >> >> (approximately) looping back up to l2 just as we do in several other >> >> cases? >> > >> > We'd potentially have to undo a fair amount more work: the toasted data >> > would have to be deleted and such, just to retry. Which isn't going to >> > super easy, because all of it will be happening with the current cid (we >> > can't just increase CommandCounterIncrement() for correctness reasons). >> >> Why would we have to delete the TOAST data? AFAIUI, the tuple points >> to the TOAST data, but not the other way around. So if we change our >> mind about where to put the tuple, I don't think that requires >> re-TOASTing. > > Consider what happens if we, after restarting at l2, notice that we > can't actually insert, but return in the !HeapTupleMayBeUpdated > branch. If the caller doesn't error out - and there's certainly callers > doing that - we'd "leak" a toasted datum. Unless the transaction aborts, > the toasted datum would never be cleaned up, because there's no datum > pointing to it, so no heap_delete will ever recurse into the toast > datum (via toast_delete()). OK, I see what you mean. Still, that doesn't seem like such a terrible cost. If you try to update a tuple and if it looks like you can update it but then after TOASTing you find that the status of the tuple has changed such that you can't update it after all, then you might need to go set xmax = MyTxid() on all of the TOAST tuples you created (whose CTIDs we could save someplace, so that it's just a matter of finding them by CTID to kill them). That's not likely to happen particularly often, though, and when it does happen it's not insanely expensive. We could also reduce the cost by letting the caller of heap_update() decide whether to back out the work; if the caller intends to throw an error anyway, then there's no point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote: >> Consider what happens if we, after restarting at l2, notice that we >> can't actually insert, but return in the !HeapTupleMayBeUpdated >> branch. > OK, I see what you mean. Still, that doesn't seem like such a > terrible cost. If you try to update a tuple and if it looks like you > can update it but then after TOASTing you find that the status of the > tuple has changed such that you can't update it after all, then you > might need to go set xmax = MyTxid() on all of the TOAST tuples you > created (whose CTIDs we could save someplace, so that it's just a > matter of finding them by CTID to kill them). ... and if you get an error or crash partway through that, what happens? regards, tom lane
On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund <andres@anarazel.de> wrote: >> > So far the best idea I have - and it's really not a good one - is to >> > invent a new hint-bit that tells concurrent updates to acquire a >> > heavyweight tuple lock, while releasing the page-level lock. If that >> > hint bit does not require any other modifications - and we don't need an >> > xid in xmax for this use case - that'll avoid doing all the other >> > `already_marked` stuff early, which should address the correctness >> > issue. >> > >> >> Don't we need to clear such a flag in case of error? Also don't we need to >> reset it later, like when modifying the old page later before WAL. > > If the flag just says "acquire a heavyweight lock", then there's no need > for that. That's cheap enough to just do if it's errorneously set. At > least I can't see any reason. I don't quite understand the intended semantics of this proposed flag. If we don't already have the tuple lock at that point, we can't go acquire it before releasing the content lock, can we? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 21, 2016 at 10:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote: >>> Consider what happens if we, after restarting at l2, notice that we >>> can't actually insert, but return in the !HeapTupleMayBeUpdated >>> branch. > >> OK, I see what you mean. Still, that doesn't seem like such a >> terrible cost. If you try to update a tuple and if it looks like you >> can update it but then after TOASTing you find that the status of the >> tuple has changed such that you can't update it after all, then you >> might need to go set xmax = MyTxid() on all of the TOAST tuples you >> created (whose CTIDs we could save someplace, so that it's just a >> matter of finding them by CTID to kill them). > > ... and if you get an error or crash partway through that, what happens? Then the transaction is aborted anyway, and we haven't leaked anything because VACUUM will clean it up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-21 10:50:36 -0400, Robert Haas wrote: > On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund <andres@anarazel.de> wrote: > >> > So far the best idea I have - and it's really not a good one - is to > >> > invent a new hint-bit that tells concurrent updates to acquire a > >> > heavyweight tuple lock, while releasing the page-level lock. If that > >> > hint bit does not require any other modifications - and we don't need an > >> > xid in xmax for this use case - that'll avoid doing all the other > >> > `already_marked` stuff early, which should address the correctness > >> > issue. > >> > > >> > >> Don't we need to clear such a flag in case of error? Also don't we need to > >> reset it later, like when modifying the old page later before WAL. > > > > If the flag just says "acquire a heavyweight lock", then there's no need > > for that. That's cheap enough to just do if it's errorneously set. At > > least I can't see any reason. > > I don't quite understand the intended semantics of this proposed flag. Whenever the flag is set, we have to acquire the heavyweight tuple lock before continuing. That guarantees nobody else can modify the tuple, while the lock is released, without requiring to modify more than one hint bit. That should fix the torn page issue, no? > If we don't already have the tuple lock at that point, we can't go > acquire it before releasing the content lock, can we? Why not? Afaics the way that tuple locks are used, the nesting should be fine. Andres
On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund <andres@anarazel.de> wrote: >> I don't quite understand the intended semantics of this proposed flag. > > Whenever the flag is set, we have to acquire the heavyweight tuple lock > before continuing. That guarantees nobody else can modify the tuple, > while the lock is released, without requiring to modify more than one > hint bit. That should fix the torn page issue, no? Yeah, I guess that would work. >> If we don't already have the tuple lock at that point, we can't go >> acquire it before releasing the content lock, can we? > > Why not? Afaics the way that tuple locks are used, the nesting should > be fine. Well, the existing places where we acquire the tuple lock within heap_update() are all careful to release the page lock first, so I'm skeptical that doing it the other order is safe. Certainly, if we've got some code that grabs the page lock and then the tuple lock and other code that grabs the tuple lock and then the page lock, that's a deadlock waiting to happen. I'm also a bit dubious that LockAcquire is safe to call in general with interrupts held. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-21 13:03:24 -0400, Robert Haas wrote: > On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund <andres@anarazel.de> wrote: > >> I don't quite understand the intended semantics of this proposed flag. > > > > Whenever the flag is set, we have to acquire the heavyweight tuple lock > > before continuing. That guarantees nobody else can modify the tuple, > > while the lock is released, without requiring to modify more than one > > hint bit. That should fix the torn page issue, no? > > Yeah, I guess that would work. > > >> If we don't already have the tuple lock at that point, we can't go > >> acquire it before releasing the content lock, can we? > > > > Why not? Afaics the way that tuple locks are used, the nesting should > > be fine. > > Well, the existing places where we acquire the tuple lock within > heap_update() are all careful to release the page lock first, so I'm > skeptical that doing it the other order is safe. Certainly, if we've > got some code that grabs the page lock and then the tuple lock and > other code that grabs the tuple lock and then the page lock, that's a > deadlock waiting to happen. Just noticed this piece of code while looking into this: UnlockReleaseBuffer(buffer); if (have_tuple_lock) UnlockTupleTuplock(relation,&(tp.t_self), LockTupleExclusive); if (vmbuffer != InvalidBuffer) ReleaseBuffer(vmbuffer); return result; seems weird to release the vmbuffer after the tuplelock... > I'm also a bit dubious that LockAcquire is safe to call in general > with interrupts held. Looks like we could just acquire the tuple-lock *before* doing the toast_insert_or_update/RelationGetBufferForTuple, but after releasing the buffer lock. That'd allow us to do avoid doing the nested locking, should make the recovery just a goto l2;, ... Andres
On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote: >> I'm also a bit dubious that LockAcquire is safe to call in general >> with interrupts held. > > Looks like we could just acquire the tuple-lock *before* doing the > toast_insert_or_update/RelationGetBufferForTuple, but after releasing > the buffer lock. That'd allow us to do avoid doing the nested locking, > should make the recovery just a goto l2;, ... Why isn't that racey? Somebody else can grab the tuple lock after we release the buffer content lock and before we acquire the tuple lock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-21 15:38:25 -0400, Robert Haas wrote: > On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote: > >> I'm also a bit dubious that LockAcquire is safe to call in general > >> with interrupts held. > > > > Looks like we could just acquire the tuple-lock *before* doing the > > toast_insert_or_update/RelationGetBufferForTuple, but after releasing > > the buffer lock. That'd allow us to do avoid doing the nested locking, > > should make the recovery just a goto l2;, ... > > Why isn't that racey? Somebody else can grab the tuple lock after we > release the buffer content lock and before we acquire the tuple lock. Sure, but by the time the tuple lock is released, they'd have updated xmax. So once we acquired that we can just do if (xmax_infomask_changed(oldtup.t_data->t_infomask, infomask) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) goto l2; which is fine, because we've not yet done the toasting. I'm not sure wether this approach is better than deleting potentially toasted data though. It's probably faster, but will likely touch more places in the code, and eat up a infomask bit (infomask & HEAP_MOVED == HEAP_MOVED in my prototype). Andres
On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-21 15:38:25 -0400, Robert Haas wrote: >> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote: >> >> I'm also a bit dubious that LockAcquire is safe to call in general >> >> with interrupts held. >> > >> > Looks like we could just acquire the tuple-lock *before* doing the >> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing >> > the buffer lock. That'd allow us to do avoid doing the nested locking, >> > should make the recovery just a goto l2;, ... >> >> Why isn't that racey? Somebody else can grab the tuple lock after we >> release the buffer content lock and before we acquire the tuple lock. > > Sure, but by the time the tuple lock is released, they'd have updated > xmax. So once we acquired that we can just do > if (xmax_infomask_changed(oldtup.t_data->t_infomask, > infomask) || > !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), > xwait)) > goto l2; > which is fine, because we've not yet done the toasting. I see. > I'm not sure wether this approach is better than deleting potentially > toasted data though. It's probably faster, but will likely touch more > places in the code, and eat up a infomask bit (infomask & HEAP_MOVED > == HEAP_MOVED in my prototype). Ugh. That's not very desirable at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-21 16:32:03 -0400, Robert Haas wrote: > On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-06-21 15:38:25 -0400, Robert Haas wrote: > >> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote: > >> >> I'm also a bit dubious that LockAcquire is safe to call in general > >> >> with interrupts held. > >> > > >> > Looks like we could just acquire the tuple-lock *before* doing the > >> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing > >> > the buffer lock. That'd allow us to do avoid doing the nested locking, > >> > should make the recovery just a goto l2;, ... > >> > >> Why isn't that racey? Somebody else can grab the tuple lock after we > >> release the buffer content lock and before we acquire the tuple lock. > > > > Sure, but by the time the tuple lock is released, they'd have updated > > xmax. So once we acquired that we can just do > > if (xmax_infomask_changed(oldtup.t_data->t_infomask, > > infomask) || > > !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), > > xwait)) > > goto l2; > > which is fine, because we've not yet done the toasting. > > I see. > > > I'm not sure wether this approach is better than deleting potentially > > toasted data though. It's probably faster, but will likely touch more > > places in the code, and eat up a infomask bit (infomask & HEAP_MOVED > > == HEAP_MOVED in my prototype). > > Ugh. That's not very desirable at all. I'm looking into three approaches right now: 1) Flag approach from above 2) Undo toasting on concurrent activity, retry 3) Use WAL logging for the already_marked = true case. 1) primarily suffers from a significant amount of complexity. I still have a bug in there that sometimes triggers "attempted to update invisible tuple" ERRORs. Otherwise it seems to perform decently performancewise - even on workloads with many backends hitting the same tuple, the retry-rate is low. 2) Seems to work too, but due to the amount of time the tuple is not locked, the retry rate can be really high. As we perform significant amount of work (toast insertion & index manipulation or extending a file) , while the tuple is not locked, it's quite likely that another session tries to modify the tuple inbetween. I think it's possible to essentially livelock. 3) This approach so far seems the best. It's possible to reuse the xl_heap_lock record (in an afaics backwards compatible manner), and in most cases the overhead isn't that large. It's of course annoying to emit more WAL, but it's not that big an overhead compared to extending a file, or to toasting. It's also by far the simplest fix. Comments?
Andres Freund wrote: > I'm looking into three approaches right now: > > 3) Use WAL logging for the already_marked = true case. > 3) This approach so far seems the best. It's possible to reuse the > xl_heap_lock record (in an afaics backwards compatible manner), and in > most cases the overhead isn't that large. It's of course annoying to > emit more WAL, but it's not that big an overhead compared to extending a > file, or to toasting. It's also by far the simplest fix. I suppose it's fine if we crash midway from emitting this wal record and the actual heap_update one, since the xmax will appear to come from an aborted xid, right? I agree that the overhead is probably negligible, considering that this only happens when toast is invoked. It's probably not as great when the new tuple goes to another page, though. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote: > Andres Freund wrote: > > > I'm looking into three approaches right now: > > > > 3) Use WAL logging for the already_marked = true case. > > > > 3) This approach so far seems the best. It's possible to reuse the > > xl_heap_lock record (in an afaics backwards compatible manner), and in > > most cases the overhead isn't that large. It's of course annoying to > > emit more WAL, but it's not that big an overhead compared to extending a > > file, or to toasting. It's also by far the simplest fix. > > I suppose it's fine if we crash midway from emitting this wal record and > the actual heap_update one, since the xmax will appear to come from an > aborted xid, right? Yea, that should be fine. > I agree that the overhead is probably negligible, considering that this > only happens when toast is invoked. It's probably not as great when the > new tuple goes to another page, though. I think it has to happen in both cases unfortunately. We could try to add some optimizations (e.g. only release lock & WAL log if the target page, via fsm, is before the current one), but I don't really want to go there in the back branches. Andres
On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote: >> Andres Freund wrote: >> >> > I'm looking into three approaches right now: >> > >> > 3) Use WAL logging for the already_marked = true case. >> >> >> > 3) This approach so far seems the best. It's possible to reuse the >> > xl_heap_lock record (in an afaics backwards compatible manner), and in >> > most cases the overhead isn't that large. It's of course annoying to >> > emit more WAL, but it's not that big an overhead compared to extending a >> > file, or to toasting. It's also by far the simplest fix. >> +1 for proceeding with Approach-3. >> I suppose it's fine if we crash midway from emitting this wal record and >> the actual heap_update one, since the xmax will appear to come from an >> aborted xid, right? > > Yea, that should be fine. > > >> I agree that the overhead is probably negligible, considering that this >> only happens when toast is invoked. It's probably not as great when the >> new tuple goes to another page, though. > > I think it has to happen in both cases unfortunately. We could try to > add some optimizations (e.g. only release lock & WAL log if the target > page, via fsm, is before the current one), but I don't really want to go > there in the back branches. > You are right, I think we can try such an optimization in Head and that too if we see a performance hit with adding this new WAL in heap_update. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 21, 2016 at 10:59:25AM +1200, Thomas Munro wrote: > On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch <noah@leadboat.com> wrote: > > I agree the non-atomic, unlogged change is a problem. A related threat > > doesn't require a torn page: > > > > AssignTransactionId() xid=123 > > heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123); > > some ERROR before heap_update() finishes > > rollback; -- xid=123 > > some backend flushes the modified page > > immediate shutdown > > AssignTransactionId() xid=123 > > commit; -- xid=123 > > > > If nothing wrote an xlog record that witnesses xid 123, the cluster can > > reassign it after recovery. The failed update is now considered a successful > > update, and the row improperly becomes dead. That's important. > > I wonder if that was originally supposed to be handled with the > HEAP_XMAX_UNLOGGED flag which was removed in 11919160. A comment in > the heap WAL logging commit f2bfe8a2 said that tqual routines would > see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging > (though I'm not sure if the tqual routines ever actually did that). HEAP_XMAX_UNLOGGED does appear to have originated in contemplation of this same hazard. Looking at the three commits in "git log -S HEAP_XMAX_UNLOGGED" (f2bfe8a b58c041 1191916), nothing ever completed the implementation by testing for that flag.
On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-20 17:55:19 -0400, Robert Haas wrote: >> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: >> >> What exactly is the point of all of that already_marked stuff? >> > >> > Preventing the old tuple from being locked/updated by another backend, >> > while unlocking the buffer. >> > >> >> I >> >> mean, suppose we just don't do any of that before we go off to do >> >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, >> >> when we reacquire the page lock, we might find that somebody else has >> >> already updated the tuple, but couldn't that be handled by >> >> (approximately) looping back up to l2 just as we do in several other >> >> cases? >> > >> > We'd potentially have to undo a fair amount more work: the toasted data >> > would have to be deleted and such, just to retry. Which isn't going to >> > super easy, because all of it will be happening with the current cid (we >> > can't just increase CommandCounterIncrement() for correctness reasons). >> >> Why would we have to delete the TOAST data? AFAIUI, the tuple points >> to the TOAST data, but not the other way around. So if we change our >> mind about where to put the tuple, I don't think that requires >> re-TOASTing. > > Consider what happens if we, after restarting at l2, notice that we > can't actually insert, but return in the !HeapTupleMayBeUpdated > branch. If the caller doesn't error out - and there's certainly callers > doing that - we'd "leak" a toasted datum. Sorry for interrupt you, but I have a question about this case. Is there case where we back to l2 after we created toasted datum(called toast_insert_or_update)? IIUC, after we stored toast datum we just insert heap tuple and log WAL (or error out for some reasons). Regards, -- Masahiko Sawada
On Tue, Jun 28, 2016 at 8:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-06-20 17:55:19 -0400, Robert Haas wrote: >>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote: >>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: >>> >> What exactly is the point of all of that already_marked stuff? >>> > >>> > Preventing the old tuple from being locked/updated by another backend, >>> > while unlocking the buffer. >>> > >>> >> I >>> >> mean, suppose we just don't do any of that before we go off to do >>> >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, >>> >> when we reacquire the page lock, we might find that somebody else has >>> >> already updated the tuple, but couldn't that be handled by >>> >> (approximately) looping back up to l2 just as we do in several other >>> >> cases? >>> > >>> > We'd potentially have to undo a fair amount more work: the toasted data >>> > would have to be deleted and such, just to retry. Which isn't going to >>> > super easy, because all of it will be happening with the current cid (we >>> > can't just increase CommandCounterIncrement() for correctness reasons). >>> >>> Why would we have to delete the TOAST data? AFAIUI, the tuple points >>> to the TOAST data, but not the other way around. So if we change our >>> mind about where to put the tuple, I don't think that requires >>> re-TOASTing. >> >> Consider what happens if we, after restarting at l2, notice that we >> can't actually insert, but return in the !HeapTupleMayBeUpdated >> branch. If the caller doesn't error out - and there's certainly callers >> doing that - we'd "leak" a toasted datum. > > Sorry for interrupt you, but I have a question about this case. > Is there case where we back to l2 after we created toasted > datum(called toast_insert_or_update)? > IIUC, after we stored toast datum we just insert heap tuple and log > WAL (or error out for some reasons). > I understood now, sorry for the noise. Regards, -- Masahiko Sawada
On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote: >>> Andres Freund wrote: >>> >>> > I'm looking into three approaches right now: >>> > >>> > 3) Use WAL logging for the already_marked = true case. >>> >>> >>> > 3) This approach so far seems the best. It's possible to reuse the >>> > xl_heap_lock record (in an afaics backwards compatible manner), and in >>> > most cases the overhead isn't that large. It's of course annoying to >>> > emit more WAL, but it's not that big an overhead compared to extending a >>> > file, or to toasting. It's also by far the simplest fix. >>> > > +1 for proceeding with Approach-3. > >>> I suppose it's fine if we crash midway from emitting this wal record and >>> the actual heap_update one, since the xmax will appear to come from an >>> aborted xid, right? >> >> Yea, that should be fine. >> >> >>> I agree that the overhead is probably negligible, considering that this >>> only happens when toast is invoked. It's probably not as great when the >>> new tuple goes to another page, though. >> >> I think it has to happen in both cases unfortunately. We could try to >> add some optimizations (e.g. only release lock & WAL log if the target >> page, via fsm, is before the current one), but I don't really want to go >> there in the back branches. >> > > You are right, I think we can try such an optimization in Head and > that too if we see a performance hit with adding this new WAL in > heap_update. > > +1 for #3 approach, and attached draft patch for that. I think attached patch would fix this problem but please let me know if this patch is not what you're thinking. Regards, -- Masahiko Sawada
Attachment
On Wed, Jun 29, 2016 at 11:14 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote: >>> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote: >>>> Andres Freund wrote: >>>> >>>> > I'm looking into three approaches right now: >>>> > >>>> > 3) Use WAL logging for the already_marked = true case. >>>> >>>> >>>> > 3) This approach so far seems the best. It's possible to reuse the >>>> > xl_heap_lock record (in an afaics backwards compatible manner), and in >>>> > most cases the overhead isn't that large. It's of course annoying to >>>> > emit more WAL, but it's not that big an overhead compared to extending a >>>> > file, or to toasting. It's also by far the simplest fix. >>>> >> >>> >> >> You are right, I think we can try such an optimization in Head and >> that too if we see a performance hit with adding this new WAL in >> heap_update. >> >> > > +1 for #3 approach, and attached draft patch for that. > I think attached patch would fix this problem but please let me know > if this patch is not what you're thinking. > Review comments: + if (RelationNeedsWAL(relation)) + { + xl_heap_lock xlrec; + XLogRecPtr recptr; + .. + xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self); + xlrec.locking_xid = xid; + xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask, + oldtup.t_data->t_infomask2); + XLogRegisterData((char *) &xlrec, SizeOfHeapLock); + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK); + PageSetLSN(page, recptr); + } There is nothing in this record which recorded the information about visibility clear flag. How will you ensure to clear the flag after crash? Have you considered to log cid using log_heap_new_cid() for logical decoding? It seems to me that the value of locking_xid should be xmax_old_tuple, why you have chosen xid? + /* Celar PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), + vmbuffer); + } + + MarkBufferDirty(buffer); + /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); I think it is better to first update tuple related info and then clear PD_ALL_VISIBLE flags (for order, refer how we have done in heap_update in the code below where you are trying to add new code). Couple of typo's - /relasing/releasing /Celar/Clear I think in this approach, it is important to measure the performance of update, may be you can use simple-update option of pgbench for various workloads. Try it with different fill factors (-F fillfactor option in pgbench). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: > There is nothing in this record which recorded the information about > visibility clear flag. I think we can actually defer the clearing to the lock release? A tuple being locked doesn't require the vm being cleared. > I think in this approach, it is important to measure the performance > of update, may be you can use simple-update option of pgbench for > various workloads. Try it with different fill factors (-F fillfactor > option in pgbench). Probably not sufficient, also needs toast activity, to show the really bad case of many lock releases.
On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >> There is nothing in this record which recorded the information about >> visibility clear flag. > > I think we can actually defer the clearing to the lock release? How about the case if after we release the lock on page, the heap page gets flushed, but not vm and then server crashes? After recovery, vacuum will never consider such a page for freezing as the vm bit still says all_frozen. Another possibility could be that WAL for xl_heap_lock got flushed, but not the heap page before crash, now after recovery, it will set the tuple with appropriate infomask and other flags, but the heap page will still be marked as ALL_VISIBLE. I think that can lead to a situation which Thomas Munro has reported upthread. All other cases in heapam.c, after clearing vm and corresponding flag in heap page, we are recording the same in WAL. Why to make this a different case and how is it safe to do it here and not at other places. > A tuple > being locked doesn't require the vm being cleared. > > >> I think in this approach, it is important to measure the performance >> of update, may be you can use simple-update option of pgbench for >> various workloads. Try it with different fill factors (-F fillfactor >> option in pgbench). > > Probably not sufficient, also needs toast activity, to show the really > bad case of many lock releases. Okay, makes sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: > On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: > >> There is nothing in this record which recorded the information about > >> visibility clear flag. > > > > I think we can actually defer the clearing to the lock release? > > How about the case if after we release the lock on page, the heap page > gets flushed, but not vm and then server crashes? In the released branches there's no need to clear all visible at that point. Note how heap_lock_tuple doesn't clear it at all. So we should be fine there, and that's the part where reusing an existing record is important (for compatibility). But your question made me realize that we despearately *do* need to clear the frozen bit in heap_lock_tuple in 9.6... Greetings, Andres Freund
On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: >> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >> >> There is nothing in this record which recorded the information about >> >> visibility clear flag. >> > >> > I think we can actually defer the clearing to the lock release? >> >> How about the case if after we release the lock on page, the heap page >> gets flushed, but not vm and then server crashes? > > In the released branches there's no need to clear all visible at that > point. Note how heap_lock_tuple doesn't clear it at all. So we should be > fine there, and that's the part where reusing an existing record is > important (for compatibility). > For back branches, I agree that heap_lock_tuple is sufficient, but in that case we should not clear the vm or page bit at all as done in proposed patch. > But your question made me realize that we despearately *do* need to > clear the frozen bit in heap_lock_tuple in 9.6... > Right. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: >>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote: >>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >>> >> There is nothing in this record which recorded the information about >>> >> visibility clear flag. >>> > >>> > I think we can actually defer the clearing to the lock release? >>> >>> How about the case if after we release the lock on page, the heap page >>> gets flushed, but not vm and then server crashes? >> >> In the released branches there's no need to clear all visible at that >> point. Note how heap_lock_tuple doesn't clear it at all. So we should be >> fine there, and that's the part where reusing an existing record is >> important (for compatibility). >> > > For back branches, I agree that heap_lock_tuple is sufficient, Even if we use heap_lock_tuple, If server crashed after flushed heap but not vm, after crash recovery the heap is still marked all-visible on vm. This case could be happen even on released branched, and could make IndexOnlyScan returns wrong result? Regards, -- Masahiko Sawada
On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote: >>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: >>>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote: >>>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >>>> >> There is nothing in this record which recorded the information about >>>> >> visibility clear flag. >>>> > >>>> > I think we can actually defer the clearing to the lock release? >>>> >>>> How about the case if after we release the lock on page, the heap page >>>> gets flushed, but not vm and then server crashes? >>> >>> In the released branches there's no need to clear all visible at that >>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be >>> fine there, and that's the part where reusing an existing record is >>> important (for compatibility). >>> >> >> For back branches, I agree that heap_lock_tuple is sufficient, > > Even if we use heap_lock_tuple, If server crashed after flushed heap > but not vm, after crash recovery the heap is still marked all-visible > on vm. So, in this case both vm and page will be marked as all_visible. > This case could be happen even on released branched, and could make > IndexOnlyScan returns wrong result? > Why do you think IndexOnlyScan will return wrong result? If the server crash in the way as you described, the transaction that has made modifications will anyway be considered aborted, so the result of IndexOnlyScan should not be wrong. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote: >>>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: >>>>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote: >>>>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >>>>> >> There is nothing in this record which recorded the information about >>>>> >> visibility clear flag. >>>>> > >>>>> > I think we can actually defer the clearing to the lock release? >>>>> >>>>> How about the case if after we release the lock on page, the heap page >>>>> gets flushed, but not vm and then server crashes? >>>> >>>> In the released branches there's no need to clear all visible at that >>>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be >>>> fine there, and that's the part where reusing an existing record is >>>> important (for compatibility). >>>> >>> >>> For back branches, I agree that heap_lock_tuple is sufficient, >> >> Even if we use heap_lock_tuple, If server crashed after flushed heap >> but not vm, after crash recovery the heap is still marked all-visible >> on vm. > > So, in this case both vm and page will be marked as all_visible. > >> This case could be happen even on released branched, and could make >> IndexOnlyScan returns wrong result? >> > > Why do you think IndexOnlyScan will return wrong result? If the > server crash in the way as you described, the transaction that has > made modifications will anyway be considered aborted, so the result of > IndexOnlyScan should not be wrong. > Ah, you're right, I misunderstood. Attached updated patch incorporating your comments. I've changed it so that heap_xlog_lock clears vm flags if page is marked all frozen. Regards, -- Masahiko Sawada
Attachment
On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Ah, you're right, I misunderstood. > > Attached updated patch incorporating your comments. > I've changed it so that heap_xlog_lock clears vm flags if page is > marked all frozen. I believe that this should be separated into two patches, since there are two issues here: 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so. 2. heap_update releases the buffer content lock without logging the changes it has made. With respect to #1, there is no need to clear the all-visible bit, only the all-frozen bit. However, that's a bit tricky given that we removed PD_ALL_FROZEN. Should we think about putting that back again?Should we just clear all-visible and call it good enough? The only cost of that is that vacuum will come along and mark the page all-visible again instead of skipping it, but that's probably not an enormous expense in most cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-07-01 15:18:39 -0400, Robert Haas wrote: > On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Ah, you're right, I misunderstood. > > > > Attached updated patch incorporating your comments. > > I've changed it so that heap_xlog_lock clears vm flags if page is > > marked all frozen. > > I believe that this should be separated into two patches, since there > are two issues here: > > 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so. > 2. heap_update releases the buffer content lock without logging the > changes it has made. > > With respect to #1, there is no need to clear the all-visible bit, > only the all-frozen bit. However, that's a bit tricky given that we > removed PD_ALL_FROZEN. Should we think about putting that back again? I think it's fine to just do the vm lookup. > Should we just clear all-visible and call it good enough? Given that we need to do that in heap_lock_tuple, which entirely preserves all-visible (but shouldn't preserve all-frozen), ISTM we better find something that doesn't invalidate all-visible. > The only > cost of that is that vacuum will come along and mark the page > all-visible again instead of skipping it, but that's probably not an > enormous expense in most cases. I think the main cost is not having the page marked as all-visible for index-only purposes. If it's an insert mostly table, it can be a long while till vacuum comes around. Andres
On 7/1/16 2:23 PM, Andres Freund wrote: >> > The only >> > cost of that is that vacuum will come along and mark the page >> > all-visible again instead of skipping it, but that's probably not an >> > enormous expense in most cases. > I think the main cost is not having the page marked as all-visible for > index-only purposes. If it's an insert mostly table, it can be a long > while till vacuum comes around. ISTM that's something that should be addressed anyway (and separately), no? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 2016-07-01 15:42:22 -0500, Jim Nasby wrote: > On 7/1/16 2:23 PM, Andres Freund wrote: > > > > The only > > > > cost of that is that vacuum will come along and mark the page > > > > all-visible again instead of skipping it, but that's probably not an > > > > enormous expense in most cases. > > I think the main cost is not having the page marked as all-visible for > > index-only purposes. If it's an insert mostly table, it can be a long > > while till vacuum comes around. > > ISTM that's something that should be addressed anyway (and separately), no? Huh? That's the current behaviour in heap_lock_tuple.
On 7/1/16 3:43 PM, Andres Freund wrote: > On 2016-07-01 15:42:22 -0500, Jim Nasby wrote: >> On 7/1/16 2:23 PM, Andres Freund wrote: >>>>> The only >>>>> cost of that is that vacuum will come along and mark the page >>>>> all-visible again instead of skipping it, but that's probably not an >>>>> enormous expense in most cases. >>> I think the main cost is not having the page marked as all-visible for >>> index-only purposes. If it's an insert mostly table, it can be a long >>> while till vacuum comes around. >> >> ISTM that's something that should be addressed anyway (and separately), no? > > Huh? That's the current behaviour in heap_lock_tuple. Oh, I was referring to autovac not being aggressive enough on insert-mostly tables. Certainly if there's a reasonable way to avoid invalidating the VM when locking a tuple that'd be good. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-01 15:18:39 -0400, Robert Haas wrote: >> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > Ah, you're right, I misunderstood. >> > >> > Attached updated patch incorporating your comments. >> > I've changed it so that heap_xlog_lock clears vm flags if page is >> > marked all frozen. >> >> I believe that this should be separated into two patches, since there >> are two issues here: >> >> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so. >> 2. heap_update releases the buffer content lock without logging the >> changes it has made. >> >> With respect to #1, there is no need to clear the all-visible bit, >> only the all-frozen bit. However, that's a bit tricky given that we >> removed PD_ALL_FROZEN. Should we think about putting that back again? > > I think it's fine to just do the vm lookup. > >> Should we just clear all-visible and call it good enough? > > Given that we need to do that in heap_lock_tuple, which entirely > preserves all-visible (but shouldn't preserve all-frozen), ISTM we > better find something that doesn't invalidate all-visible. > Sounds logical, considering that we have a way to set all-frozen and vacuum does that as well. So probably either we need to have a new API or add a new parameter to visibilitymap_clear() to indicate the same. If we want to go that route, isn't it better to have PD_ALL_FROZEN as well? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> Why do you think IndexOnlyScan will return wrong result? If the >> server crash in the way as you described, the transaction that has >> made modifications will anyway be considered aborted, so the result of >> IndexOnlyScan should not be wrong. >> > > Ah, you're right, I misunderstood. > > Attached updated patch incorporating your comments. > I've changed it so that heap_xlog_lock clears vm flags if page is > marked all frozen. > I think we should make a similar change in heap_lock_tuple API as well. Also, currently by default heap_xlog_lock tuple tries to clear the visibility flags, isn't it better to handle it as we do at all other places (ex. see log_heap_update), by logging the information about same. I think it is always advisable to log every action we want replay to perform. That way, it is always easy to extend it based on if some change is required only in certain cases, but not in others. Though, it is important to get the patch right, but I feel in the meantime, it might be better to start benchmarking. AFAIU, even if change some part of information while WAL logging it, the benchmark results won't be much different. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> >>> Why do you think IndexOnlyScan will return wrong result? If the >>> server crash in the way as you described, the transaction that has >>> made modifications will anyway be considered aborted, so the result of >>> IndexOnlyScan should not be wrong. >>> >> >> Ah, you're right, I misunderstood. >> >> Attached updated patch incorporating your comments. >> I've changed it so that heap_xlog_lock clears vm flags if page is >> marked all frozen. >> > > I think we should make a similar change in heap_lock_tuple API as > well. > Also, currently by default heap_xlog_lock tuple tries to clear > the visibility flags, isn't it better to handle it as we do at all > other places (ex. see log_heap_update), by logging the information > about same. I will deal with them. > Though, it is important to get the patch right, but I feel in the > meantime, it might be better to start benchmarking. AFAIU, even if > change some part of information while WAL logging it, the benchmark > results won't be much different. Okay, I will do the benchmark test as well. Regards, -- Masahiko Sawada
On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-07-01 15:18:39 -0400, Robert Haas wrote: >>> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> > Ah, you're right, I misunderstood. >>> > >>> > Attached updated patch incorporating your comments. >>> > I've changed it so that heap_xlog_lock clears vm flags if page is >>> > marked all frozen. >>> >>> I believe that this should be separated into two patches, since there >>> are two issues here: >>> >>> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so. >>> 2. heap_update releases the buffer content lock without logging the >>> changes it has made. >>> >>> With respect to #1, there is no need to clear the all-visible bit, >>> only the all-frozen bit. However, that's a bit tricky given that we >>> removed PD_ALL_FROZEN. Should we think about putting that back again? >> >> I think it's fine to just do the vm lookup. >> >>> Should we just clear all-visible and call it good enough? >> >> Given that we need to do that in heap_lock_tuple, which entirely >> preserves all-visible (but shouldn't preserve all-frozen), ISTM we >> better find something that doesn't invalidate all-visible. >> > > Sounds logical, considering that we have a way to set all-frozen and > vacuum does that as well. So probably either we need to have a new > API or add a new parameter to visibilitymap_clear() to indicate the > same. If we want to go that route, isn't it better to have > PD_ALL_FROZEN as well? > Cant' we call visibilitymap_set with all-visible but not all-frozen bits instead of clearing flags? Regards, -- Masahiko Sawada
On Mon, Jul 4, 2016 at 2:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote: >>> On 2016-07-01 15:18:39 -0400, Robert Haas wrote: >>> >>>> Should we just clear all-visible and call it good enough? >>> >>> Given that we need to do that in heap_lock_tuple, which entirely >>> preserves all-visible (but shouldn't preserve all-frozen), ISTM we >>> better find something that doesn't invalidate all-visible. >>> >> >> Sounds logical, considering that we have a way to set all-frozen and >> vacuum does that as well. So probably either we need to have a new >> API or add a new parameter to visibilitymap_clear() to indicate the >> same. If we want to go that route, isn't it better to have >> PD_ALL_FROZEN as well? >> > > Cant' we call visibilitymap_set with all-visible but not all-frozen > bits instead of clearing flags? > That doesn't sound to be an impressive way to deal. First, visibilitymap_set logs the action itself which will generate two WAL records (one for visibility map and another for lock tuple). Second, it doesn't look consistent to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 4, 2016 at 5:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> Why do you think IndexOnlyScan will return wrong result? If the >>>> server crash in the way as you described, the transaction that has >>>> made modifications will anyway be considered aborted, so the result of >>>> IndexOnlyScan should not be wrong. >>>> >>> >>> Ah, you're right, I misunderstood. >>> >>> Attached updated patch incorporating your comments. >>> I've changed it so that heap_xlog_lock clears vm flags if page is >>> marked all frozen. >>> >> >> I think we should make a similar change in heap_lock_tuple API as >> well. >> Also, currently by default heap_xlog_lock tuple tries to clear >> the visibility flags, isn't it better to handle it as we do at all >> other places (ex. see log_heap_update), by logging the information >> about same. > > I will deal with them. > >> Though, it is important to get the patch right, but I feel in the >> meantime, it might be better to start benchmarking. AFAIU, even if >> change some part of information while WAL logging it, the benchmark >> results won't be much different. > > Okay, I will do the benchmark test as well. > I measured the thoughput and the output quantity of WAL with HEAD and HEAD+patch(attached) on my virtual environment. I used pgbench with attached custom script file which sets 3200 length string to the filler column in order to make toast data. The scale factor is 1000 and pgbench options are, -c 4 -T 600 -f toast_test.sql. After changed filler column to the text data type I ran it. * Throughput HEAD : 1833.204172 Patched : 1827.399482 * Output quantity of WAL HEAD : 7771 MB Patched : 8082 MB The throughput is almost same, but the ouput quantity of WAL is slightly increased. (about 4%) Regards, -- Masahiko Sawada
Attachment
On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > index 57da57a..fd66527 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -3923,6 +3923,17 @@ l2: > > if (need_toast || newtupsize > pagefree) > { > + /* > + * To prevent data corruption due to updating old tuple by > + * other backends after released buffer That's not really the reason, is it? The prime problem is crash safety / replication. The row-lock we're faking (by setting xmax to our xid), prevents concurrent updates until this backend died. > , we need to emit that > + * xmax of old tuple is set and clear visibility map bits if > + * needed before releasing buffer. We can reuse xl_heap_lock > + * for this purpose. It should be fine even if we crash midway > + * from this section and the actual updating one later, since > + * the xmax will appear to come from an aborted xid. > + */ > + START_CRIT_SECTION(); > + > /* Clear obsolete visibility flags ... */ > oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); > oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; > @@ -3936,6 +3947,46 @@ l2: > /* temporarily make it look not-updated */ > oldtup.t_data->t_ctid = oldtup.t_self; > already_marked = true; > + > + /* Clear PD_ALL_VISIBLE flags */ > + if (PageIsAllVisible(BufferGetPage(buffer))) > + { > + all_visible_cleared = true; > + PageClearAllVisible(BufferGetPage(buffer)); > + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), > + vmbuffer); > + } > + > + MarkBufferDirty(buffer); > + > + if (RelationNeedsWAL(relation)) > + { > + xl_heap_lock xlrec; > + XLogRecPtr recptr; > + > + /* > + * For logical decoding we need combocids to properly decode the > + * catalog. > + */ > + if (RelationIsAccessibleInLogicalDecoding(relation)) > + log_heap_new_cid(relation, &oldtup); Hm, I don't see that being necessary here. Row locks aren't logically decoded, so there's no need to emit this here. > + /* Clear PD_ALL_VISIBLE flags */ > + if (PageIsAllVisible(page)) > + { > + Buffer vmbuffer = InvalidBuffer; > + BlockNumber block = BufferGetBlockNumber(*buffer); > + > + all_visible_cleared = true; > + PageClearAllVisible(page); > + visibilitymap_pin(relation, block, &vmbuffer); > + visibilitymap_clear(relation, block, vmbuffer); > + } > + That clears all-visible unnecessarily, we only need to clear all-frozen. > @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) > } > HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); > HeapTupleHeaderSetCmax(htup, FirstCommandId, false); > + > + /* The visibility map need to be cleared */ > + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) > + { > + RelFileNode rnode; > + Buffer vmbuffer = InvalidBuffer; > + BlockNumber blkno; > + Relation reln; > + > + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); > + reln = CreateFakeRelcacheEntry(rnode); > + > + visibilitymap_pin(reln, blkno, &vmbuffer); > + visibilitymap_clear(reln, blkno, vmbuffer); > + PageClearAllVisible(page); > + } > + > PageSetLSN(page, lsn); > MarkBufferDirty(buffer); > } > diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h > index a822d0b..41b3c54 100644 > --- a/src/include/access/heapam_xlog.h > +++ b/src/include/access/heapam_xlog.h > @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info > #define XLHL_XMAX_EXCL_LOCK 0x04 > #define XLHL_XMAX_KEYSHR_LOCK 0x08 > #define XLHL_KEYS_UPDATED 0x10 > +#define XLHL_ALL_VISIBLE_CLEARED 0x20 Hm. We can't easily do that in the back-patched version; because a standby won't know to check for the flag . That's kinda ok, since we don't yet need to clear all-visible yet at that point of heap_update. But that better means we don't do so on the master either. Greetings, Andres Freund
On Thu, Jul 7, 2016 at 3:36 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: > >> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) >> } >> HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); >> HeapTupleHeaderSetCmax(htup, FirstCommandId, false); >> + >> + /* The visibility map need to be cleared */ >> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) >> + { >> + RelFileNode rnode; >> + Buffer vmbuffer = InvalidBuffer; >> + BlockNumber blkno; >> + Relation reln; >> + >> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); >> + reln = CreateFakeRelcacheEntry(rnode); >> + >> + visibilitymap_pin(reln, blkno, &vmbuffer); >> + visibilitymap_clear(reln, blkno, vmbuffer); >> + PageClearAllVisible(page); >> + } >> + > > >> PageSetLSN(page, lsn); >> MarkBufferDirty(buffer); >> } >> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h >> index a822d0b..41b3c54 100644 >> --- a/src/include/access/heapam_xlog.h >> +++ b/src/include/access/heapam_xlog.h >> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info >> #define XLHL_XMAX_EXCL_LOCK 0x04 >> #define XLHL_XMAX_KEYSHR_LOCK 0x08 >> #define XLHL_KEYS_UPDATED 0x10 >> +#define XLHL_ALL_VISIBLE_CLEARED 0x20 > > Hm. We can't easily do that in the back-patched version; because a > standby won't know to check for the flag . That's kinda ok, since we > don't yet need to clear all-visible yet at that point of > heap_update. But that better means we don't do so on the master either. > To clarify, do you mean to say that lets have XLHL_ALL_FROZEN_CLEARED and do that just for master. For back-branches no need to clear visibility any flags? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Than you for reviewing! On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: >> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c >> index 57da57a..fd66527 100644 >> --- a/src/backend/access/heap/heapam.c >> +++ b/src/backend/access/heap/heapam.c >> @@ -3923,6 +3923,17 @@ l2: >> >> if (need_toast || newtupsize > pagefree) >> { >> + /* >> + * To prevent data corruption due to updating old tuple by >> + * other backends after released buffer > > That's not really the reason, is it? The prime problem is crash safety / > replication. The row-lock we're faking (by setting xmax to our xid), > prevents concurrent updates until this backend died. Fixed. >> , we need to emit that >> + * xmax of old tuple is set and clear visibility map bits if >> + * needed before releasing buffer. We can reuse xl_heap_lock >> + * for this purpose. It should be fine even if we crash midway >> + * from this section and the actual updating one later, since >> + * the xmax will appear to come from an aborted xid. >> + */ >> + START_CRIT_SECTION(); >> + >> /* Clear obsolete visibility flags ... */ >> oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); >> oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; >> @@ -3936,6 +3947,46 @@ l2: >> /* temporarily make it look not-updated */ >> oldtup.t_data->t_ctid = oldtup.t_self; >> already_marked = true; >> + >> + /* Clear PD_ALL_VISIBLE flags */ >> + if (PageIsAllVisible(BufferGetPage(buffer))) >> + { >> + all_visible_cleared = true; >> + PageClearAllVisible(BufferGetPage(buffer)); >> + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), >> + vmbuffer); >> + } >> + >> + MarkBufferDirty(buffer); >> + >> + if (RelationNeedsWAL(relation)) >> + { >> + xl_heap_lock xlrec; >> + XLogRecPtr recptr; >> + >> + /* >> + * For logical decoding we need combocids to properly decode the >> + * catalog. >> + */ >> + if (RelationIsAccessibleInLogicalDecoding(relation)) >> + log_heap_new_cid(relation, &oldtup); > > Hm, I don't see that being necessary here. Row locks aren't logically > decoded, so there's no need to emit this here. Fixed. > >> + /* Clear PD_ALL_VISIBLE flags */ >> + if (PageIsAllVisible(page)) >> + { >> + Buffer vmbuffer = InvalidBuffer; >> + BlockNumber block = BufferGetBlockNumber(*buffer); >> + >> + all_visible_cleared = true; >> + PageClearAllVisible(page); >> + visibilitymap_pin(relation, block, &vmbuffer); >> + visibilitymap_clear(relation, block, vmbuffer); >> + } >> + > > That clears all-visible unnecessarily, we only need to clear all-frozen. > Fixed. > >> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) >> } >> HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); >> HeapTupleHeaderSetCmax(htup, FirstCommandId, false); >> + >> + /* The visibility map need to be cleared */ >> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) >> + { >> + RelFileNode rnode; >> + Buffer vmbuffer = InvalidBuffer; >> + BlockNumber blkno; >> + Relation reln; >> + >> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); >> + reln = CreateFakeRelcacheEntry(rnode); >> + >> + visibilitymap_pin(reln, blkno, &vmbuffer); >> + visibilitymap_clear(reln, blkno, vmbuffer); >> + PageClearAllVisible(page); >> + } >> + > > >> PageSetLSN(page, lsn); >> MarkBufferDirty(buffer); >> } >> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h >> index a822d0b..41b3c54 100644 >> --- a/src/include/access/heapam_xlog.h >> +++ b/src/include/access/heapam_xlog.h >> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info >> #define XLHL_XMAX_EXCL_LOCK 0x04 >> #define XLHL_XMAX_KEYSHR_LOCK 0x08 >> #define XLHL_KEYS_UPDATED 0x10 >> +#define XLHL_ALL_VISIBLE_CLEARED 0x20 > > Hm. We can't easily do that in the back-patched version; because a > standby won't know to check for the flag . That's kinda ok, since we > don't yet need to clear all-visible yet at that point of > heap_update. But that better means we don't do so on the master either. > Attached latest version patch. I changed visibilitymap_clear function so that it allows to specify bits being cleared. The function that needs to clear the only all-frozen bit on visibility map calls visibilitymap_clear_extended function to clear particular bit. Other function can call visibilitymap_clear function to clear all bits for one page. Instead of adding XLHL_ALL_VISIBLE_CLEARED, we do vm loop up for back branches. To reduce unnecessary looking up visibility map, I changed it so that we check the PD_ALL_VISIBLE on heap page, and then look up all-frozen bit on visibility map if necessary. Regards, -- Masahiko Sawada
Attachment
On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote: > Hm. We can't easily do that in the back-patched version; because a > standby won't know to check for the flag . That's kinda ok, since we > don't yet need to clear all-visible yet at that point of > heap_update. But that better means we don't do so on the master either. Is there any reason to back-patch this in the first place? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote: > > Hm. We can't easily do that in the back-patched version; because a > > standby won't know to check for the flag . That's kinda ok, since we > > don't yet need to clear all-visible yet at that point of > > heap_update. But that better means we don't do so on the master either. > > Is there any reason to back-patch this in the first place? Wasn't this determined to be a pre-existing bug? I think the probability of occurrence has increased, but it's still possible in earlier releases. I wonder if there are unexplained bugs that can be traced down to this. I'm not really following this (sorry about that) but I wonder if (in back branches) the failure to propagate in case the standby wasn't updated can cause actual problems. If it does, maybe it'd be a better idea to have a new WAL record type instead of piggybacking on lock tuple. Then again, apparently the probability of this bug is low enough that we shouldn't sweat over it ... Moreso considering Robert's apparent opinion that perhaps we shouldn't backpatch at all in the first place. In any case I would like to see much more commentary in the patch next to the new XLHL flag. It's sufficiently different than the rest than it deserves so, IMO. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 7, 2016 at 10:53 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hm. We can't easily do that in the back-patched version; because a >> > standby won't know to check for the flag . That's kinda ok, since we >> > don't yet need to clear all-visible yet at that point of >> > heap_update. But that better means we don't do so on the master either. >> >> Is there any reason to back-patch this in the first place? > > Wasn't this determined to be a pre-existing bug? I think the > probability of occurrence has increased, but it's still possible in > earlier releases. I wonder if there are unexplained bugs that can be > traced down to this. > > I'm not really following this (sorry about that) but I wonder if (in > back branches) the failure to propagate in case the standby wasn't > updated can cause actual problems. If it does, maybe it'd be a better > idea to have a new WAL record type instead of piggybacking on lock > tuple. Then again, apparently the probability of this bug is low enough > that we shouldn't sweat over it ... Moreso considering Robert's apparent > opinion that perhaps we shouldn't backpatch at all in the first place. > > In any case I would like to see much more commentary in the patch next > to the new XLHL flag. It's sufficiently different than the rest than it > deserves so, IMO. There are two issues being discussed on this thread. One of them is a new issue in 9.6: heap_lock_tuple needs to clear the all-frozen bit in the freeze map even though it does not clear all-visible. The one that's actually a preexisting bug is that we can start to update a tuple without WAL-logging anything and then release the page lock in order to go perform TOAST insertions. At this point, other backends (on the master) will see this tuple as in the process of being updated because xmax has been set and ctid has been made to point back to the same tuple. I'm guessing that if the UPDATE goes on to complete, any discrepancy between the master and the standby is erased by the replay of the WAL record covering the update itself. I haven't checked that, but it seems like that WAL record must set both xmax and ctid appropriately or we'd be in big trouble. The infomask bits are in play too, but presumably the update's WAL is going to set those correctly also. So in this case I don't think there's really any issue for the standby. Or for the master, either: it may technically be true the tuple is not all-visible any more, but the only backend that could potentially fail to see it is the one performing the update, and no user code can run in the middle of toast_insert_or_update, so I think we're OK. On the other hand, if the UPDATE aborts, there's now a persistent difference between the master and standby: the infomask, xmax, and ctid of the tuple may differ. I don't know whether that could cause any problem. It's probably a very rare case, because there aren't all that many things that will cause us to abort in the middle of inserting TOAST tuples. Out of disk space comes to mind, or maybe some kind of corruption that throws an elog(). As far as back-patching goes, the question is whether it's worth the risk. Introducing new WAL logging at this point could certainly cause performance problems if nothing else, never mind the risk of garden-variety bugs. I'm not sure it's worth taking that risk in released branches for the sake of a bug which has existed for a decade without anybody finding it until now. I'm not going to argue strongly for that position, but I think it's worth thinking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-07-07 10:37:15 -0400, Robert Haas wrote: > On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote: > > Hm. We can't easily do that in the back-patched version; because a > > standby won't know to check for the flag . That's kinda ok, since we > > don't yet need to clear all-visible yet at that point of > > heap_update. But that better means we don't do so on the master either. > > Is there any reason to back-patch this in the first place? It seems not unlikely that this has caused corruption in the past; and that we chalked it up to hardware corruption or something. Both toasting and file extension frequently take extended amounts of time under load, the window for crashing in the wrong moment isn't small... Andres
On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-07 10:37:15 -0400, Robert Haas wrote: >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hm. We can't easily do that in the back-patched version; because a >> > standby won't know to check for the flag . That's kinda ok, since we >> > don't yet need to clear all-visible yet at that point of >> > heap_update. But that better means we don't do so on the master either. >> >> Is there any reason to back-patch this in the first place? > > It seems not unlikely that this has caused corruption in the past; and > that we chalked it up to hardware corruption or something. Both toasting > and file extension frequently take extended amounts of time under load, > the window for crashing in the wrong moment isn't small... Yeah, that's true, but I'm having a bit of trouble imagining exactly we end up with corruption that actually matters. I guess a torn page could do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-07-07 14:01:05 -0400, Robert Haas wrote: > On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-07-07 10:37:15 -0400, Robert Haas wrote: > >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote: > >> > Hm. We can't easily do that in the back-patched version; because a > >> > standby won't know to check for the flag . That's kinda ok, since we > >> > don't yet need to clear all-visible yet at that point of > >> > heap_update. But that better means we don't do so on the master either. > >> > >> Is there any reason to back-patch this in the first place? > > > > It seems not unlikely that this has caused corruption in the past; and > > that we chalked it up to hardware corruption or something. Both toasting > > and file extension frequently take extended amounts of time under load, > > the window for crashing in the wrong moment isn't small... > > Yeah, that's true, but I'm having a bit of trouble imagining exactly > we end up with corruption that actually matters. I guess a torn page > could do it. I think Noah pointed out a bad scenario: If we crash after putting the xid in the page header, but before WAL logging, the xid could get reused after the crash. By a different transaction. And suddenly the row isn't visible anymore, after the reused xid commits...
On Thu, Jul 7, 2016 at 2:04 PM, Andres Freund <andres@anarazel.de> wrote: >> Yeah, that's true, but I'm having a bit of trouble imagining exactly >> we end up with corruption that actually matters. I guess a torn page >> could do it. > > I think Noah pointed out a bad scenario: If we crash after putting the > xid in the page header, but before WAL logging, the xid could get reused > after the crash. By a different transaction. And suddenly the row isn't > visible anymore, after the reused xid commits... Oh, wow. Yikes. OK, so I guess we should try to back-patch the fix, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Than you for reviewing! > > On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: >>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c >>> index 57da57a..fd66527 100644 >>> --- a/src/backend/access/heap/heapam.c >>> +++ b/src/backend/access/heap/heapam.c >>> @@ -3923,6 +3923,17 @@ l2: >>> >>> if (need_toast || newtupsize > pagefree) >>> { >>> + /* >>> + * To prevent data corruption due to updating old tuple by >>> + * other backends after released buffer >> >> That's not really the reason, is it? The prime problem is crash safety / >> replication. The row-lock we're faking (by setting xmax to our xid), >> prevents concurrent updates until this backend died. > > Fixed. > >>> , we need to emit that >>> + * xmax of old tuple is set and clear visibility map bits if >>> + * needed before releasing buffer. We can reuse xl_heap_lock >>> + * for this purpose. It should be fine even if we crash midway >>> + * from this section and the actual updating one later, since >>> + * the xmax will appear to come from an aborted xid. >>> + */ >>> + START_CRIT_SECTION(); >>> + >>> /* Clear obsolete visibility flags ... */ >>> oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); >>> oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; >>> @@ -3936,6 +3947,46 @@ l2: >>> /* temporarily make it look not-updated */ >>> oldtup.t_data->t_ctid = oldtup.t_self; >>> already_marked = true; >>> + >>> + /* Clear PD_ALL_VISIBLE flags */ >>> + if (PageIsAllVisible(BufferGetPage(buffer))) >>> + { >>> + all_visible_cleared = true; >>> + PageClearAllVisible(BufferGetPage(buffer)); >>> + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), >>> + vmbuffer); >>> + } >>> + >>> + MarkBufferDirty(buffer); >>> + >>> + if (RelationNeedsWAL(relation)) >>> + { >>> + xl_heap_lock xlrec; >>> + XLogRecPtr recptr; >>> + >>> + /* >>> + * For logical decoding we need combocids to properly decode the >>> + * catalog. >>> + */ >>> + if (RelationIsAccessibleInLogicalDecoding(relation)) >>> + log_heap_new_cid(relation, &oldtup); >> >> Hm, I don't see that being necessary here. Row locks aren't logically >> decoded, so there's no need to emit this here. > > Fixed. > >> >>> + /* Clear PD_ALL_VISIBLE flags */ >>> + if (PageIsAllVisible(page)) >>> + { >>> + Buffer vmbuffer = InvalidBuffer; >>> + BlockNumber block = BufferGetBlockNumber(*buffer); >>> + >>> + all_visible_cleared = true; >>> + PageClearAllVisible(page); >>> + visibilitymap_pin(relation, block, &vmbuffer); >>> + visibilitymap_clear(relation, block, vmbuffer); >>> + } >>> + >> >> That clears all-visible unnecessarily, we only need to clear all-frozen. >> > > Fixed. > >> >>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) >>> } >>> HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); >>> HeapTupleHeaderSetCmax(htup, FirstCommandId, false); >>> + >>> + /* The visibility map need to be cleared */ >>> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) >>> + { >>> + RelFileNode rnode; >>> + Buffer vmbuffer = InvalidBuffer; >>> + BlockNumber blkno; >>> + Relation reln; >>> + >>> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); >>> + reln = CreateFakeRelcacheEntry(rnode); >>> + >>> + visibilitymap_pin(reln, blkno, &vmbuffer); >>> + visibilitymap_clear(reln, blkno, vmbuffer); >>> + PageClearAllVisible(page); >>> + } >>> + >> >> >>> PageSetLSN(page, lsn); >>> MarkBufferDirty(buffer); >>> } >>> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h >>> index a822d0b..41b3c54 100644 >>> --- a/src/include/access/heapam_xlog.h >>> +++ b/src/include/access/heapam_xlog.h >>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info >>> #define XLHL_XMAX_EXCL_LOCK 0x04 >>> #define XLHL_XMAX_KEYSHR_LOCK 0x08 >>> #define XLHL_KEYS_UPDATED 0x10 >>> +#define XLHL_ALL_VISIBLE_CLEARED 0x20 >> >> Hm. We can't easily do that in the back-patched version; because a >> standby won't know to check for the flag . That's kinda ok, since we >> don't yet need to clear all-visible yet at that point of >> heap_update. But that better means we don't do so on the master either. >> > > Attached latest version patch. + /* Clear only the all-frozen bit on visibility map if needed */ + if (PageIsAllVisible(BufferGetPage(buffer)) && + VM_ALL_FROZEN(relation, block, &vmbuffer)) + { + visibilitymap_clear_extended(relation, block, vmbuffer, + VISIBILITYMAP_ALL_FROZEN); + } + + if (RelationNeedsWAL(relation)) + { .. + XLogBeginInsert(); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + + xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self); + xlrec.locking_xid = xmax_old_tuple; + xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask, + oldtup.t_data->t_infomask2); + XLogRegisterData((char *) &xlrec, SizeOfHeapLock); + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK); .. One thing that looks awkward in this code is that it doesn't record whether the frozen bit is actually cleared during the actual operation and then during replay, it always clear the frozen bit irrespective of whether it has been cleared by the actual operation or not. + /* Clear only the all-frozen bit on visibility map if needed */ + if (PageIsAllVisible(page) && + VM_ALL_FROZEN(relation, BufferGetBlockNumber(*buffer), &vmbuffer)) + { + BlockNumber block = BufferGetBlockNumber(*buffer); + + visibilitymap_pin(relation, block, &vmbuffer); I think it is not right to call visibilitymap_pin after holding a buffer lock (visibilitymap_pin can perform I/O). Refer heap_update for how to pin the visibility map. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 8, 2016 at 10:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Than you for reviewing! >> >> On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote: >>> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: >>>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c >>>> index 57da57a..fd66527 100644 >>>> --- a/src/backend/access/heap/heapam.c >>>> +++ b/src/backend/access/heap/heapam.c >>>> @@ -3923,6 +3923,17 @@ l2: >>>> >>>> if (need_toast || newtupsize > pagefree) >>>> { >>>> + /* >>>> + * To prevent data corruption due to updating old tuple by >>>> + * other backends after released buffer >>> >>> That's not really the reason, is it? The prime problem is crash safety / >>> replication. The row-lock we're faking (by setting xmax to our xid), >>> prevents concurrent updates until this backend died. >> >> Fixed. >> >>>> , we need to emit that >>>> + * xmax of old tuple is set and clear visibility map bits if >>>> + * needed before releasing buffer. We can reuse xl_heap_lock >>>> + * for this purpose. It should be fine even if we crash midway >>>> + * from this section and the actual updating one later, since >>>> + * the xmax will appear to come from an aborted xid. >>>> + */ >>>> + START_CRIT_SECTION(); >>>> + >>>> /* Clear obsolete visibility flags ... */ >>>> oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); >>>> oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; >>>> @@ -3936,6 +3947,46 @@ l2: >>>> /* temporarily make it look not-updated */ >>>> oldtup.t_data->t_ctid = oldtup.t_self; >>>> already_marked = true; >>>> + >>>> + /* Clear PD_ALL_VISIBLE flags */ >>>> + if (PageIsAllVisible(BufferGetPage(buffer))) >>>> + { >>>> + all_visible_cleared = true; >>>> + PageClearAllVisible(BufferGetPage(buffer)); >>>> + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), >>>> + vmbuffer); >>>> + } >>>> + >>>> + MarkBufferDirty(buffer); >>>> + >>>> + if (RelationNeedsWAL(relation)) >>>> + { >>>> + xl_heap_lock xlrec; >>>> + XLogRecPtr recptr; >>>> + >>>> + /* >>>> + * For logical decoding we need combocids to properly decode the >>>> + * catalog. >>>> + */ >>>> + if (RelationIsAccessibleInLogicalDecoding(relation)) >>>> + log_heap_new_cid(relation, &oldtup); >>> >>> Hm, I don't see that being necessary here. Row locks aren't logically >>> decoded, so there's no need to emit this here. >> >> Fixed. >> >>> >>>> + /* Clear PD_ALL_VISIBLE flags */ >>>> + if (PageIsAllVisible(page)) >>>> + { >>>> + Buffer vmbuffer = InvalidBuffer; >>>> + BlockNumber block = BufferGetBlockNumber(*buffer); >>>> + >>>> + all_visible_cleared = true; >>>> + PageClearAllVisible(page); >>>> + visibilitymap_pin(relation, block, &vmbuffer); >>>> + visibilitymap_clear(relation, block, vmbuffer); >>>> + } >>>> + >>> >>> That clears all-visible unnecessarily, we only need to clear all-frozen. >>> >> >> Fixed. >> >>> >>>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) >>>> } >>>> HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); >>>> HeapTupleHeaderSetCmax(htup, FirstCommandId, false); >>>> + >>>> + /* The visibility map need to be cleared */ >>>> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) >>>> + { >>>> + RelFileNode rnode; >>>> + Buffer vmbuffer = InvalidBuffer; >>>> + BlockNumber blkno; >>>> + Relation reln; >>>> + >>>> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); >>>> + reln = CreateFakeRelcacheEntry(rnode); >>>> + >>>> + visibilitymap_pin(reln, blkno, &vmbuffer); >>>> + visibilitymap_clear(reln, blkno, vmbuffer); >>>> + PageClearAllVisible(page); >>>> + } >>>> + >>> >>> >>>> PageSetLSN(page, lsn); >>>> MarkBufferDirty(buffer); >>>> } >>>> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h >>>> index a822d0b..41b3c54 100644 >>>> --- a/src/include/access/heapam_xlog.h >>>> +++ b/src/include/access/heapam_xlog.h >>>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info >>>> #define XLHL_XMAX_EXCL_LOCK 0x04 >>>> #define XLHL_XMAX_KEYSHR_LOCK 0x08 >>>> #define XLHL_KEYS_UPDATED 0x10 >>>> +#define XLHL_ALL_VISIBLE_CLEARED 0x20 >>> >>> Hm. We can't easily do that in the back-patched version; because a >>> standby won't know to check for the flag . That's kinda ok, since we >>> don't yet need to clear all-visible yet at that point of >>> heap_update. But that better means we don't do so on the master either. >>> >> >> Attached latest version patch. > > + /* Clear only the all-frozen bit on visibility map if needed */ > > + if (PageIsAllVisible(BufferGetPage(buffer)) && > > + VM_ALL_FROZEN(relation, block, &vmbuffer)) > + { > + visibilitymap_clear_extended(relation, block, vmbuffer, > + VISIBILITYMAP_ALL_FROZEN); > + } > + > > + if (RelationNeedsWAL(relation)) > + { > .. > > + XLogBeginInsert(); > + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); > + > + xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self); > + xlrec.locking_xid = xmax_old_tuple; > + xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask, > + oldtup.t_data->t_infomask2); > + XLogRegisterData((char *) &xlrec, SizeOfHeapLock); > + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK); > .. > > One thing that looks awkward in this code is that it doesn't record > whether the frozen bit is actually cleared during the actual operation > and then during replay, it always clear the frozen bit irrespective of > whether it has been cleared by the actual operation or not. > I changed it so that we look all-frozen bit up first, and then clear it if needed. > + /* Clear only the all-frozen bit on visibility map if needed */ > + if (PageIsAllVisible(page) && > + VM_ALL_FROZEN(relation, BufferGetBlockNumber(*buffer), &vmbuffer)) > + { > + BlockNumber block = BufferGetBlockNumber(*buffer); > + > + visibilitymap_pin(relation, block, &vmbuffer); > > I think it is not right to call visibilitymap_pin after holding a > buffer lock (visibilitymap_pin can perform I/O). Refer heap_update > for how to pin the visibility map. > Thank you for your advice! Fixed. Attached separated two patched, please give me feedbacks. Regards, -- Masahiko Sawada
Attachment
Hi, So I'm generally happy with 0001, baring some relatively minor adjustments. I am however wondering about one thing: On 2016-07-11 23:51:05 +0900, Masahiko Sawada wrote: > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > index 57da57a..e7cb8ca 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -3923,6 +3923,16 @@ l2: > > if (need_toast || newtupsize > pagefree) > { > + /* > + * For crash safety, we need to emit that xmax of old tuple is set > + * and clear only the all-frozen bit on visibility map if needed > + * before releasing the buffer. We can reuse xl_heap_lock for this > + * purpose. It should be fine even if we crash midway from this > + * section and the actual updating one later, since the xmax will > + * appear to come from an aborted xid. > + */ > + START_CRIT_SECTION(); > + > /* Clear obsolete visibility flags ... */ > oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); > oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; > @@ -3936,6 +3946,28 @@ l2: > /* temporarily make it look not-updated */ > oldtup.t_data->t_ctid = oldtup.t_self; > already_marked = true; > + > + MarkBufferDirty(buffer); > + > + if (RelationNeedsWAL(relation)) > + { > + xl_heap_lock xlrec; > + XLogRecPtr recptr; > + > + XLogBeginInsert(); > + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); > + > + xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self); > + xlrec.locking_xid = xmax_old_tuple; > + xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask, > + oldtup.t_data->t_infomask2); > + XLogRegisterData((char *) &xlrec, SizeOfHeapLock); > + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK); > + PageSetLSN(page, recptr); > + } Master does /* temporarily make it look not-updated */ oldtup.t_data->t_ctid = oldtup.t_self; here, and as is the wal record won't reflect that, because: static void heap_xlog_lock(XLogReaderState *record) { ... /* * Clear relevant update flags, but only if the modified infomask says * there's no update. */ if(HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask)) { HeapTupleHeaderClearHotUpdated(htup); /* Make sure thereis no forward chain link in t_ctid */ ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum); } won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and standby / after crash recovery. I'm failing to see any harmful consequences right now, but differences between master and standby are a bad thing. Pre 9.3 that's not a problem, we reset ctid and HOT_UPDATED unconditionally there. I think I'm more comfortable with setting HEAP_XMAX_LOCK_ONLY until the tuple is finally updated - that also coincides more closely with the actual meaning. Any arguments against? > > + /* Clear only the all-frozen bit on visibility map if needed */ > + if (PageIsAllVisible(BufferGetPage(buffer)) && > + VM_ALL_FROZEN(relation, block, &vmbuffer)) > + { > + visibilitymap_clear_extended(relation, block, vmbuffer, > + VISIBILITYMAP_ALL_FROZEN); > + } > + FWIW, I don't think it's worth introducing visibilitymap_clear_extended. As this is a 9.6 only patch, i think it's better to change visibilitymap_clear's API. Unless somebody protests I'm planning to commit with those adjustments tomorrow. Greetings, Andres Freund
On Thu, Jul 14, 2016 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > Master does > /* temporarily make it look not-updated */ > oldtup.t_data->t_ctid = oldtup.t_self; > here, and as is the wal record won't reflect that, because: > static void > heap_xlog_lock(XLogReaderState *record) > { > ... > /* > * Clear relevant update flags, but only if the modified infomask says > * there's no update. > */ > if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask)) > { > HeapTupleHeaderClearHotUpdated(htup); > /* Make sure there is no forward chain link in t_ctid */ > ItemPointerSet(&htup->t_ctid, > BufferGetBlockNumber(buffer), > offnum); > } > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and > standby / after crash recovery. I'm failing to see any harmful > consequences right now, but differences between master and standby are a bad > thing. Pre 9.3 that's not a problem, we reset ctid and HOT_UPDATED > unconditionally there. I think I'm more comfortable with setting > HEAP_XMAX_LOCK_ONLY until the tuple is finally updated - that also > coincides more closely with the actual meaning. > Just thinking out loud. If we set HEAP_XMAX_LOCK_ONLY during update, then won't it impact the return value of HeapTupleHeaderIsOnlyLocked(). It will start returning true whereas otherwise I think it would have returned false due to in_progress transaction. As HeapTupleHeaderIsOnlyLocked() is being used at many places, it might impact those cases, I have not checked in deep whether such an impact would cause any real issue, but it seems to me that some analysis is needed there unless you think we are safe with respect to that. > Any arguments against? > >> >> + /* Clear only the all-frozen bit on visibility map if needed */ >> + if (PageIsAllVisible(BufferGetPage(buffer)) && >> + VM_ALL_FROZEN(relation, block, &vmbuffer)) >> + { >> + visibilitymap_clear_extended(relation, block, vmbuffer, >> + VISIBILITYMAP_ALL_FROZEN); >> + } >> + > > FWIW, I don't think it's worth introducing visibilitymap_clear_extended. > As this is a 9.6 only patch, i think it's better to change > visibilitymap_clear's API. > > Unless somebody protests I'm planning to commit with those adjustments > tomorrow. > Do you think performance tests done by Sawada-san are sufficient to proceed here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2016-07-14 18:12:42 +0530, Amit Kapila wrote: > Just thinking out loud. If we set HEAP_XMAX_LOCK_ONLY during update, > then won't it impact the return value of > HeapTupleHeaderIsOnlyLocked(). It will start returning true whereas > otherwise I think it would have returned false due to in_progress > transaction. As HeapTupleHeaderIsOnlyLocked() is being used at many > places, it might impact those cases, I have not checked in deep > whether such an impact would cause any real issue, but it seems to me > that some analysis is needed there unless you think we are safe with > respect to that. I don't think that's an issue: Right now the row will be considered deleted in that moment, with the change it's considered locked. the latter is surely more appropriate. > > Any arguments against? > > > >> > >> + /* Clear only the all-frozen bit on visibility map if needed */ > >> + if (PageIsAllVisible(BufferGetPage(buffer)) && > >> + VM_ALL_FROZEN(relation, block, &vmbuffer)) > >> + { > >> + visibilitymap_clear_extended(relation, block, vmbuffer, > >> + VISIBILITYMAP_ALL_FROZEN); > >> + } > >> + > > > > FWIW, I don't think it's worth introducing visibilitymap_clear_extended. > > As this is a 9.6 only patch, i think it's better to change > > visibilitymap_clear's API. > > > > Unless somebody protests I'm planning to commit with those adjustments > > tomorrow. > > > > Do you think performance tests done by Sawada-san are sufficient to > proceed here? I'm doing some more, but generally yes. I also don't think we have much of a choice anyway. Greetings, Andres Freund
On 2016-07-13 23:06:07 -0700, Andres Freund wrote: > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and > standby / after crash recovery. I'm failing to see any harmful > consequences right now, but differences between master and standby are a bad > thing. I think it's actually critical, because HEAP_HOT_UPDATED / HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like heap_hot_search_buffer()). Andres
On 2016-07-14 20:53:07 -0700, Andres Freund wrote: > On 2016-07-13 23:06:07 -0700, Andres Freund wrote: > > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which > > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and > > standby / after crash recovery. I'm failing to see any harmful > > consequences right now, but differences between master and standby are a bad > > thing. > > I think it's actually critical, because HEAP_HOT_UPDATED / > HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like > heap_hot_search_buffer()). I've pushed a quite heavily revised version of the first patch to 9.1-master. I manually verified using pageinspect, gdb breakpoints and a standby that xmax, infomask etc are set correctly (leading to finding a4d357bf). As there's noticeable differences, especially 9.2->9.3, between versions, I'd welcome somebody having a look at the commits. Regards, Andres
On 2016-07-13 23:06:07 -0700, Andres Freund wrote: > > + /* Clear only the all-frozen bit on visibility map if needed */ > > + if (PageIsAllVisible(BufferGetPage(buffer)) && > > + VM_ALL_FROZEN(relation, block, &vmbuffer)) > > + { > > + visibilitymap_clear_extended(relation, block, vmbuffer, > > + VISIBILITYMAP_ALL_FROZEN); > > + } > > + > > FWIW, I don't think it's worth introducing visibilitymap_clear_extended. > As this is a 9.6 only patch, i think it's better to change > visibilitymap_clear's API. Besides that easily fixed issue, the code also has the significant issue that it's only performing the the visibilitymap processing in the BLK_NEEDS_REDO case. But that's not ok, because both in the BLK_RESTORED and the BLK_DONE cases the visibilitymap isn't guaranteed (or even likely in the former case) to have been updated. I think we have two choices how to deal with that: First, we can add a new flags variable to xl_heap_lock similar to xl_heap_insert/update/... and bump page magic, or we can squeeze the information into infobits_set. The latter seems fairly ugly, and fragile to me; so unless somebody protests I'm going with the former. I think due to padding the additional byte doesn't make any size difference anyway. Regards, Andres
On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-13 23:06:07 -0700, Andres Freund wrote: >> > + /* Clear only the all-frozen bit on visibility map if needed */ >> > + if (PageIsAllVisible(BufferGetPage(buffer)) && >> > + VM_ALL_FROZEN(relation, block, &vmbuffer)) >> > + { >> > + visibilitymap_clear_extended(relation, block, vmbuffer, >> > + VISIBILITYMAP_ALL_FROZEN); >> > + } >> > + >> >> FWIW, I don't think it's worth introducing visibilitymap_clear_extended. >> As this is a 9.6 only patch, i think it's better to change >> visibilitymap_clear's API. > > Besides that easily fixed issue, the code also has the significant issue > that it's only performing the the visibilitymap processing in the > BLK_NEEDS_REDO case. But that's not ok, because both in the BLK_RESTORED > and the BLK_DONE cases the visibilitymap isn't guaranteed (or even > likely in the former case) to have been updated. > > I think we have two choices how to deal with that: First, we can add a > new flags variable to xl_heap_lock similar to > xl_heap_insert/update/... and bump page magic, > +1 for going in this way. This will keep us consistent with how clear the visibility info in other places like heap_xlog_update(). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de> wrote: >> I think we have two choices how to deal with that: First, we can add a >> new flags variable to xl_heap_lock similar to >> xl_heap_insert/update/... and bump page magic, > +1 for going in this way. This will keep us consistent with how clear > the visibility info in other places like heap_xlog_update(). Yeah. We've already forced a catversion bump for beta3, and I'm about to go fix PG_CONTROL_VERSION as well, so there's basically no downside to doing an xlog version bump as well. At least, not if you can get it in before Monday. regards, tom lane
On July 16, 2016 8:49:06 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Amit Kapila <amit.kapila16@gmail.com> writes: >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de> >wrote: >>> I think we have two choices how to deal with that: First, we can add >a >>> new flags variable to xl_heap_lock similar to >>> xl_heap_insert/update/... and bump page magic, > >> +1 for going in this way. This will keep us consistent with how >clear >> the visibility info in other places like heap_xlog_update(). > >Yeah. We've already forced a catversion bump for beta3, and I'm about >to go fix PG_CONTROL_VERSION as well, so there's basically no downside >to doing an xlog version bump as well. At least, not if you can get it >in before Monday. OK, Cool. Will do it later today. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 2016-07-16 10:45:26 -0700, Andres Freund wrote: > > > On July 16, 2016 8:49:06 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >Amit Kapila <amit.kapila16@gmail.com> writes: > >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de> > >wrote: > >>> I think we have two choices how to deal with that: First, we can add > >a > >>> new flags variable to xl_heap_lock similar to > >>> xl_heap_insert/update/... and bump page magic, > > > >> +1 for going in this way. This will keep us consistent with how > >clear > >> the visibility info in other places like heap_xlog_update(). > > > >Yeah. We've already forced a catversion bump for beta3, and I'm about > >to go fix PG_CONTROL_VERSION as well, so there's basically no downside > >to doing an xlog version bump as well. At least, not if you can get it > >in before Monday. > > OK, Cool. Will do it later today. Took till today. Attached is a rather heavily revised version of Sawada-san's patch. Most notably the recovery routines take care to reset the vm in all cases, we don't perform visibilitymap_get_status from inside a critical section anymore, and heap_lock_updated_tuple_rec() also resets the vm (although I'm not entirely sure that can practically be hit). I'm doing some more testing, and Robert said he could take a quick look at the patch. If somebody else... Will push sometime after dinner. Regards, Andres
Attachment
On Sun, Jul 17, 2016 at 10:48 PM, Andres Freund <andres@anarazel.de> wrote: > Took till today. Attached is a rather heavily revised version of > Sawada-san's patch. Most notably the recovery routines take care to > reset the vm in all cases, we don't perform visibilitymap_get_status > from inside a critical section anymore, and > heap_lock_updated_tuple_rec() also resets the vm (although I'm not > entirely sure that can practically be hit). > > I'm doing some more testing, and Robert said he could take a quick look > at the patch. If somebody else... Will push sometime after dinner. Thanks very much for working on this. Random suggestions after a quick look: + * Before locking the buffer, pin the visibility map page if it may be + * necessary. s/necessary/needed/ More substantively, what happens if the situation changes before we obtain the buffer lock? I think you need to release the page lock, pin the page after all, and then relock the page. There seem to be several ways to escape from this function without releasing the pin on vmbuffer. From the visibilitymap_pin call here, search downward for "return". + * visibilitymap_clear - clear bit(s) for one page in visibility map I don't really like the parenthesized-s convention as a shorthand for "one or more". It tends to confuse non-native English speakers. + * any I/O. Returns whether any bits have been cleared. I suggest "Returns true if any bits have been cleared and false otherwise". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 18, 2016 at 8:18 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-16 10:45:26 -0700, Andres Freund wrote: >> >> >> On July 16, 2016 8:49:06 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >Amit Kapila <amit.kapila16@gmail.com> writes: >> >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de> >> >wrote: >> >>> I think we have two choices how to deal with that: First, we can add >> >a >> >>> new flags variable to xl_heap_lock similar to >> >>> xl_heap_insert/update/... and bump page magic, >> > >> >> +1 for going in this way. This will keep us consistent with how >> >clear >> >> the visibility info in other places like heap_xlog_update(). >> > >> >Yeah. We've already forced a catversion bump for beta3, and I'm about >> >to go fix PG_CONTROL_VERSION as well, so there's basically no downside >> >to doing an xlog version bump as well. At least, not if you can get it >> >in before Monday. >> >> OK, Cool. Will do it later today. > > Took till today. Attached is a rather heavily revised version of > Sawada-san's patch. Most notably the recovery routines take care to > reset the vm in all cases, we don't perform visibilitymap_get_status > from inside a critical section anymore, and > heap_lock_updated_tuple_rec() also resets the vm (although I'm not > entirely sure that can practically be hit). > @@ -4563,8 +4579,18 @@ heap_lock_tuple(Relation relation, HeapTuple tuple, + /* + * Before locking the buffer, pin the visibility map page if it may be + * necessary. + */ + if (PageIsAllVisible(BufferGetPage(*buffer))) + visibilitymap_pin(relation, block, &vmbuffer); + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); I think we need to check for PageIsAllVisible and try to pin the visibility map after taking the lock on buffer. I think it is quite possible that in the time this routine tries to acquire lock on buffer, the page becomes all visible. To avoid the similar hazard, we do try to check the visibility of page after acquiring buffer lock in heap_update() at below place. if (vmbuffer == InvalidBuffer && PageIsAllVisible(page)) Similarly, I think heap_lock_updated_tuple_rec() needs to take care of same. While I was typing this e-mail, it seems Robert has already pointed the same issue. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2016-07-17 23:34:01 -0400, Robert Haas wrote: > Thanks very much for working on this. Random suggestions after a quick look: > > + * Before locking the buffer, pin the visibility map page if it may be > + * necessary. > > s/necessary/needed/ > > More substantively, what happens if the situation changes before we > obtain the buffer lock? I think you need to release the page lock, > pin the page after all, and then relock the page. It shouldn't be able to. Cleanup locks, which are required for vacuumlazy to do anything relevant, aren't possible with the buffer pinned. This pattern is used in heap_delete/heap_update, so I think we're on a reasonably well trodden path. > There seem to be several ways to escape from this function without > releasing the pin on vmbuffer. From the visibilitymap_pin call here, > search downward for "return". Hm, that's cleary not good. The best thing to address that seems to be to create a separate jump label, which check vmbuffer and releases the page lock. Unless you have a better idea. > + * visibilitymap_clear - clear bit(s) for one page in visibility map > > I don't really like the parenthesized-s convention as a shorthand for > "one or more". It tends to confuse non-native English speakers. > > + * any I/O. Returns whether any bits have been cleared. > > I suggest "Returns true if any bits have been cleared and false otherwise". Will change. - Andres
On 2016-07-18 09:07:19 +0530, Amit Kapila wrote: > + /* > + * Before locking the buffer, pin the visibility map page if it may be > + * necessary. > + */ > > + if (PageIsAllVisible(BufferGetPage(*buffer))) > + visibilitymap_pin(relation, block, &vmbuffer); > + > LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); > > I think we need to check for PageIsAllVisible and try to pin the > visibility map after taking the lock on buffer. I think it is quite > possible that in the time this routine tries to acquire lock on > buffer, the page becomes all visible. I don't see how. Without a cleanup lock it's not possible to mark a page all-visible/frozen. We might miss the bit becoming unset concurrently, but that's ok. Andres
On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote: >> + /* >> + * Before locking the buffer, pin the visibility map page if it may be >> + * necessary. >> + */ >> >> + if (PageIsAllVisible(BufferGetPage(*buffer))) >> + visibilitymap_pin(relation, block, &vmbuffer); >> + >> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); >> >> I think we need to check for PageIsAllVisible and try to pin the >> visibility map after taking the lock on buffer. I think it is quite >> possible that in the time this routine tries to acquire lock on >> buffer, the page becomes all visible. > > I don't see how. Without a cleanup lock it's not possible to mark a page > all-visible/frozen. > Consider the below scenario. Vacuum a. acquires a cleanup lock for page - 10 b. busy in checking visibility of tuples --assume, here it takes some time and in the meantime Session-1 performs step (a) and (b) and start waiting in step- (c) c. marks the page as all-visible (PageSetAllVisible) d. unlockandrelease the buffer Session-1 a. In heap_lock_tuple(), readbuffer for page-10 b. check PageIsAllVisible(), found page is not all-visible, so didn't acquire the visbilitymap_pin c. LockBuffer in ExlusiveMode - here it will wait for vacuum to release the lock d. Got the lock, but now the page is marked as all-visible, so ideally need to recheck the page and acquire the visibilitymap_pin -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote: > >> + /* > >> + * Before locking the buffer, pin the visibility map page if it may be > >> + * necessary. > >> + */ > >> > >> + if (PageIsAllVisible(BufferGetPage(*buffer))) > >> + visibilitymap_pin(relation, block, &vmbuffer); > >> + > >> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); > >> > >> I think we need to check for PageIsAllVisible and try to pin the > >> visibility map after taking the lock on buffer. I think it is quite > >> possible that in the time this routine tries to acquire lock on > >> buffer, the page becomes all visible. > > > > I don't see how. Without a cleanup lock it's not possible to mark a page > > all-visible/frozen. > > > > Consider the below scenario. > > Vacuum > a. acquires a cleanup lock for page - 10 > b. busy in checking visibility of tuples > --assume, here it takes some time and in the meantime Session-1 > performs step (a) and (b) and start waiting in step- (c) > c. marks the page as all-visible (PageSetAllVisible) > d. unlockandrelease the buffer > > Session-1 > a. In heap_lock_tuple(), readbuffer for page-10 > b. check PageIsAllVisible(), found page is not all-visible, so didn't > acquire the visbilitymap_pin > c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > release the lock > d. Got the lock, but now the page is marked as all-visible, so ideally > need to recheck the page and acquire the visibilitymap_pin So, I've tried pretty hard to reproduce that. While the theory above is sound, I believe the relevant code-path is essentially dead for SQL callable code, because we'll always hold a buffer pin before even entering heap_update/heap_lock_tuple. It's possible that you could concoct a dangerous scenario with follow_updates though; but I can't immediately see how. Due to that, and based on the closing in beta release, I'm planning to push a version of the patch that the returns fixed; but not this. It seems better to have the majority of the fix in. Andres
On 2016-07-18 01:33:10 -0700, Andres Freund wrote: > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > > On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote: > > > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote: > > >> + /* > > >> + * Before locking the buffer, pin the visibility map page if it may be > > >> + * necessary. > > >> + */ > > >> > > >> + if (PageIsAllVisible(BufferGetPage(*buffer))) > > >> + visibilitymap_pin(relation, block, &vmbuffer); > > >> + > > >> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); > > >> > > >> I think we need to check for PageIsAllVisible and try to pin the > > >> visibility map after taking the lock on buffer. I think it is quite > > >> possible that in the time this routine tries to acquire lock on > > >> buffer, the page becomes all visible. > > > > > > I don't see how. Without a cleanup lock it's not possible to mark a page > > > all-visible/frozen. > > > > > > > Consider the below scenario. > > > > Vacuum > > a. acquires a cleanup lock for page - 10 > > b. busy in checking visibility of tuples > > --assume, here it takes some time and in the meantime Session-1 > > performs step (a) and (b) and start waiting in step- (c) > > c. marks the page as all-visible (PageSetAllVisible) > > d. unlockandrelease the buffer > > > > Session-1 > > a. In heap_lock_tuple(), readbuffer for page-10 > > b. check PageIsAllVisible(), found page is not all-visible, so didn't > > acquire the visbilitymap_pin > > c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > > release the lock > > d. Got the lock, but now the page is marked as all-visible, so ideally > > need to recheck the page and acquire the visibilitymap_pin > > So, I've tried pretty hard to reproduce that. While the theory above is > sound, I believe the relevant code-path is essentially dead for SQL > callable code, because we'll always hold a buffer pin before even > entering heap_update/heap_lock_tuple. It's possible that you could > concoct a dangerous scenario with follow_updates though; but I can't > immediately see how. Due to that, and based on the closing in beta > release, I'm planning to push a version of the patch that the returns > fixed; but not this. It seems better to have the majority of the fix > in. Pushed that way. Let's try to figure out a good solution to a) test this case b) how to fix it in a reasonable way. Note that there's also http://archives.postgresql.org/message-id/20160718071729.tlj4upxhaylwv75n%40alap3.anarazel.de which seems related. Regards, Andres
On Sat, Jul 16, 2016 at 10:08 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-14 20:53:07 -0700, Andres Freund wrote: >> On 2016-07-13 23:06:07 -0700, Andres Freund wrote: >> > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which >> > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and >> > standby / after crash recovery. I'm failing to see any harmful >> > consequences right now, but differences between master and standby are a bad >> > thing. >> >> I think it's actually critical, because HEAP_HOT_UPDATED / >> HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like >> heap_hot_search_buffer()). > > I've pushed a quite heavily revised version of the first patch to > 9.1-master. I manually verified using pageinspect, gdb breakpoints and a > standby that xmax, infomask etc are set correctly (leading to finding > a4d357bf). As there's noticeable differences, especially 9.2->9.3, > between versions, I'd welcome somebody having a look at the commits. Waoh, man. Thanks! I have been just pinged this week end about a set up that likely has faced this exact problem in the shape of "tuple concurrently updated" with a node getting kill-9-ed by some framework because it did not finish its shutdown checkpoint after some time in some test which enforced it to do crash recovery. I have not been able to put my hands on the raw data to have a look at the flags set within those tuples but I got the string feeling that this is related to that. After a couple of rounds doing so, it was possible to see "tuple concurrently updated" errors for a relation that has few pages and a high update rate using 9.4. More seriously, I have spent some time looking at what you have pushed on each branch, and the fixes are looking correct to me. -- Michael
On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: >> > >> >> Consider the below scenario. >> >> Vacuum >> a. acquires a cleanup lock for page - 10 >> b. busy in checking visibility of tuples >> --assume, here it takes some time and in the meantime Session-1 >> performs step (a) and (b) and start waiting in step- (c) >> c. marks the page as all-visible (PageSetAllVisible) >> d. unlockandrelease the buffer >> >> Session-1 >> a. In heap_lock_tuple(), readbuffer for page-10 >> b. check PageIsAllVisible(), found page is not all-visible, so didn't >> acquire the visbilitymap_pin >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to >> release the lock >> d. Got the lock, but now the page is marked as all-visible, so ideally >> need to recheck the page and acquire the visibilitymap_pin > > So, I've tried pretty hard to reproduce that. While the theory above is > sound, I believe the relevant code-path is essentially dead for SQL > callable code, because we'll always hold a buffer pin before even > entering heap_update/heap_lock_tuple. > It is possible that we don't hold any buffer pin before entering heap_update() and or heap_lock_tuple(). For heap_update(), it is possible when it enters via simple_heap_update() path. For heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement and may be others as well. Let me also try to explain with a test for both the cases, if above is not clear enough. Case-1 for heap_update() ----------------------------------- Session-1 Create table t1(c1 int); Alter table t1 alter column c1 set default 10; --via debugger stop at StoreAttrDefault()/heap_update, while you are in heap_update(), note down the block number Session-2 vacuum (DISABLE_PAGE_SKIPPING) pg_attribute; -- In lazy_scan_heap(), stop at line (if (all_visible && !all_visible_according_to_vm))) for block number noted in Session-1. Session-1 In debugger, proceed and let it wait at lockbuffer, note that it will not pin the visibility map. Session-2 Set the visibility flag and complete the operation. Session-1 You will notice that it will attempt to unlock the buffer, pin the visibility map, lock the buffer again. Case-2 for heap_lock_tuple() ---------------------------------------- Session-1 Create table i_conflict(c1 int, c2 char(100)); Create unique index idx_u on i_conflict(c1); Insert into i_conflict values(1,'aaa'); Insert into i_conflict values(1,'aaa') On Conflict (c1) Do Update Set c2='bbb'; -- via debugger, stop at line 385 in nodeModifyTable.c (In ExecInsert(), at if (onconflict == ONCONFLICT_UPDATE)). Session-2 ------------- vacuum (DISABLE_PAGE_SKIPPING) i_conflict --stop before setting the all-visible flag Session-1 -------------- In debugger, proceed and let it wait at lockbuffer, note that it will not pin the visibility map. Session-2 --------------- Set the visibility flag and complete the operation. Session-1 -------------- PANIC: wrong buffer passed to visibilitymap_clear --this is problematic. Attached patch fixes the problem for me. Note, I have not tried to reproduce the problem for heap_lock_updated_tuple_rec(), but I think if you are convinced with above cases, then we should have a similar check in it as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Attached patch fixes the problem for me. Note, I have not tried to > reproduce the problem for heap_lock_updated_tuple_rec(), but I think > if you are convinced with above cases, then we should have a similar > check in it as well. I don't think this hunk is correct: + /* + * If we didn't pin the visibility map page and the page has become + * all visible, we'll have to unlock and re-lock. See heap_lock_tuple + * for details. + */ + if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf))) + { + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + visibilitymap_pin(rel, block, &vmbuffer); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + goto l4; + } The code beginning at label l4 appears that the buffer is unlocked, but this code leaves the buffer unlocked. Also, I don't see the point of doing this test so far down in the function. Why not just recheck *immediately* after taking the buffer lock? If you find out that you need the pin after all, then LockBuffer(buf, BUFFER_LOCK_UNLOCK); visibilitymap_pin(rel, block, &vmbuffer); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); but *do not* go back to l4. Unless I'm missing something, putting this block further down, as you have it, buys nothing, because none of that intervening code can release the buffer lock without using goto to jump back to l4. + /* + * If we didn't pin the visibility map page and the page has become all + * visible while we were busy locking the buffer, or during some + * subsequent window during which we had it unlocked, we'll have to unlock + * and re-lock, to avoid holding the buffer lock across an I/O. That's a + * bit unfortunate, especially since we'll now have to recheck whether the + * tuple has been locked or updated under us, but hopefully it won't + * happen very often. + */ + if (vmbuffer == InvalidBuffer && PageIsAllVisible(page)) + { + LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); + visibilitymap_pin(relation, block, &vmbuffer); + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto l3; + } In contrast, this looks correct: l3 expects the buffer to be locked already, and the code above this point and below the point this logic can unlock and re-lock the buffer, potentially multiple times. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 27, 2016 at 3:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Attached patch fixes the problem for me. Note, I have not tried to >> reproduce the problem for heap_lock_updated_tuple_rec(), but I think >> if you are convinced with above cases, then we should have a similar >> check in it as well. > > I don't think this hunk is correct: > > + /* > + * If we didn't pin the visibility map page and the page has become > + * all visible, we'll have to unlock and re-lock. See heap_lock_tuple > + * for details. > + */ > + if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf))) > + { > + LockBuffer(buf, BUFFER_LOCK_UNLOCK); > + visibilitymap_pin(rel, block, &vmbuffer); > + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); > + goto l4; > + } > > The code beginning at label l4 appears that the buffer is unlocked, > but this code leaves the buffer unlocked. Also, I don't see the point > of doing this test so far down in the function. Why not just recheck > *immediately* after taking the buffer lock? > Right, in this case we can recheck immediately after taking buffer lock, updated patch attached. In the passing by, I have noticed that heap_delete() doesn't do this unlocking, pinning of vm and locking at appropriate place. It just checks immediately after taking lock, whereas in the down code, it do unlock and lock the buffer again. I think we should do it as in attached patch (pin_vm_heap_delete-v1.patch). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote: > On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > >> Consider the below scenario. > >> > >> Vacuum > >> a. acquires a cleanup lock for page - 10 > >> b. busy in checking visibility of tuples > >> --assume, here it takes some time and in the meantime Session-1 > >> performs step (a) and (b) and start waiting in step- (c) > >> c. marks the page as all-visible (PageSetAllVisible) > >> d. unlockandrelease the buffer > >> > >> Session-1 > >> a. In heap_lock_tuple(), readbuffer for page-10 > >> b. check PageIsAllVisible(), found page is not all-visible, so didn't > >> acquire the visbilitymap_pin > >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > >> release the lock > >> d. Got the lock, but now the page is marked as all-visible, so ideally > >> need to recheck the page and acquire the visibilitymap_pin > > > > So, I've tried pretty hard to reproduce that. While the theory above is > > sound, I believe the relevant code-path is essentially dead for SQL > > callable code, because we'll always hold a buffer pin before even > > entering heap_update/heap_lock_tuple. > > > > It is possible that we don't hold any buffer pin before entering > heap_update() and or heap_lock_tuple(). For heap_update(), it is > possible when it enters via simple_heap_update() path. For > heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement > and may be others as well. This is currently listed as a 9.6 open item. Is it indeed a regression in 9.6, or do released versions have the same defect? If it is a 9.6 regression, do you happen to know which commit, or at least which feature, caused it? Thanks, nm
On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch <noah@leadboat.com> wrote: > On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote: >> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: >> >> Consider the below scenario. >> >> >> >> Vacuum >> >> a. acquires a cleanup lock for page - 10 >> >> b. busy in checking visibility of tuples >> >> --assume, here it takes some time and in the meantime Session-1 >> >> performs step (a) and (b) and start waiting in step- (c) >> >> c. marks the page as all-visible (PageSetAllVisible) >> >> d. unlockandrelease the buffer >> >> >> >> Session-1 >> >> a. In heap_lock_tuple(), readbuffer for page-10 >> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't >> >> acquire the visbilitymap_pin >> >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to >> >> release the lock >> >> d. Got the lock, but now the page is marked as all-visible, so ideally >> >> need to recheck the page and acquire the visibilitymap_pin >> > >> > So, I've tried pretty hard to reproduce that. While the theory above is >> > sound, I believe the relevant code-path is essentially dead for SQL >> > callable code, because we'll always hold a buffer pin before even >> > entering heap_update/heap_lock_tuple. >> > >> >> It is possible that we don't hold any buffer pin before entering >> heap_update() and or heap_lock_tuple(). For heap_update(), it is >> possible when it enters via simple_heap_update() path. For >> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement >> and may be others as well. > > This is currently listed as a 9.6 open item. Is it indeed a regression in > 9.6, or do released versions have the same defect? If it is a 9.6 regression, > do you happen to know which commit, or at least which feature, caused it? > Commit eca0f1db is the reason for this specific issue. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 02, 2016 at 02:10:29PM +0530, Amit Kapila wrote: > On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch <noah@leadboat.com> wrote: > > On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote: > >> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote: > >> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > >> >> Consider the below scenario. > >> >> > >> >> Vacuum > >> >> a. acquires a cleanup lock for page - 10 > >> >> b. busy in checking visibility of tuples > >> >> --assume, here it takes some time and in the meantime Session-1 > >> >> performs step (a) and (b) and start waiting in step- (c) > >> >> c. marks the page as all-visible (PageSetAllVisible) > >> >> d. unlockandrelease the buffer > >> >> > >> >> Session-1 > >> >> a. In heap_lock_tuple(), readbuffer for page-10 > >> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't > >> >> acquire the visbilitymap_pin > >> >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > >> >> release the lock > >> >> d. Got the lock, but now the page is marked as all-visible, so ideally > >> >> need to recheck the page and acquire the visibilitymap_pin > >> > > >> > So, I've tried pretty hard to reproduce that. While the theory above is > >> > sound, I believe the relevant code-path is essentially dead for SQL > >> > callable code, because we'll always hold a buffer pin before even > >> > entering heap_update/heap_lock_tuple. > >> > > >> > >> It is possible that we don't hold any buffer pin before entering > >> heap_update() and or heap_lock_tuple(). For heap_update(), it is > >> possible when it enters via simple_heap_update() path. For > >> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement > >> and may be others as well. > > > > This is currently listed as a 9.6 open item. Is it indeed a regression in > > 9.6, or do released versions have the same defect? If it is a 9.6 regression, > > do you happen to know which commit, or at least which feature, caused it? > > > > Commit eca0f1db is the reason for this specific issue. [Action required within 72 hours. This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Andres, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a 9.6 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within 72 hours of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed in advance of shipping 9.6rc1 next week. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com
Hi, On 2016-08-02 10:55:18 -0400, Noah Misch wrote: > [Action required within 72 hours. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 9.6 open item. Andres, > since you committed the patch believed to have created it, you own this open > item. Well kinda (it was a partial fix for something not originally by me), but I'll deal with. Reading now, will commit tomorrow. Regards, Andres
On Thu, Aug 4, 2016 at 3:24 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2016-08-02 10:55:18 -0400, Noah Misch wrote: >> [Action required within 72 hours. This is a generic notification.] >> >> The above-described topic is currently a PostgreSQL 9.6 open item. Andres, >> since you committed the patch believed to have created it, you own this open >> item. > > Well kinda (it was a partial fix for something not originally by me), > but I'll deal with. Reading now, will commit tomorrow. Thanks. I kept meaning to get to this one, and failing to do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company