Thread: Reviewing freeze map code

Reviewing freeze map code

From

Andres Freund

Date:

03 May 2016, 00:48:24

Hi,

The freeze map changes, besides being very important, seem to be one of
the patches with a high risk profile in 9.6.  Robert had asked whether
I'd take a look.  I thought it'd be a good idea to review that while
running tests for
http://www.postgresql.org/message-id/CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+Pce93Sftw@mail.gmail.com

For starters, I'm just going through the commits. It seems the relevant
pieces are:

a892234 Change the format of the VM fork to add a second bit per page.
77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
fd31cd2 Don't vacuum all-frozen pages.
7087166 pg_upgrade: Convert old visibility map format to new format.
ba0a198 Add pg_visibility contrib module.

did I miss anything important?

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Andres Freund

Date:

03 May 2016, 03:25:16

Hi,

some of the review items here are mere matters of style/preference. Feel
entirely free to discard them, but I thought if I'm going through the
change anyway...


On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
> a892234 Change the format of the VM fork to add a second bit per page.

TL;DR: fairly minor stuff.


+ * heap_tuple_needs_eventual_freeze
+ *
+ * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
+ * will eventually require freezing.  Similar to heap_tuple_needs_freeze,
+ * but there's no cutoff, since we're trying to figure out whether freezing
+ * will ever be needed, not whether it's needed now.
+ */
+bool
+heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)

Wouldn't redefining this to heap_tuple_is_frozen() and then inverting the
checks be easier to understand?


+    /*
+     * If xmax is a valid xact or multixact, this tuple is also not frozen.
+     */
+    if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+    {
+        MultiXactId multi;
+
+        multi = HeapTupleHeaderGetRawXmax(tuple);
+        if (MultiXactIdIsValid(multi))
+            return true;
+    }

Hm. What's the test inside the if() for? There shouldn't be any case
where xmax is invalid if HEAP_XMAX_IS_MULTI is set.   Now there's a
check like that outside of this commit, but it seems strange to me
(Alvaro, perhaps you could comment on this?).


+ *
+ * Clearing both visibility map bits is not separately WAL-logged.  The callers * must make sure that whenever a bit
iscleared, the bit is cleared on WAL * replay of the updating operation as well.
 

I think including "both" here makes things less clear, because it
differentiates clearing one bit from clearing both. There's no practical
differentce atm, but still.
 * * VACUUM will normally skip pages for which the visibility map bit is set; * such pages can't contain any dead
tuplesand therefore don't need vacuuming.
 
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples. *

I think the remaining sentence isn't entirely accurate, there's now more
than one bit, and they're different with regard to scan_all/!scan_all
vacuums (or will be - maybe this updated further in a later commit? But
if so, that sentence shouldn't yet be removed...).


-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-

Hm, why was this moved to the header? Sounds like something the outside
shouldn't care about.


#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)

Hm. This isn't really a mapping to an individual bit anymore - but I
don't really have a better name in mind. Maybe TO_OFFSET?


+static const uint8 number_of_ones_for_visible[256] = {
...
+};
+static const uint8 number_of_ones_for_frozen[256] = {
...};

Did somebody verify the new contents are correct?


/*
- *    visibilitymap_clear - clear a bit in visibility map
+ *    visibilitymap_clear - clear all bits in visibility map *

This seems rather easy to misunderstand, as this really only clears all
the bits for one page, not actually all the bits.


 * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map. * * NOTE: This function is typically called without a lock
onthe heap page, * so somebody else could change the bit just after we look at it.  In fact,
 
@@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,

I'm not seing what flags the above comment change is referring to?

    /*
-     * A single-bit read is atomic.  There could be memory-ordering effects
+     * A single byte read is atomic.  There could be memory-ordering effects     * here, but for performance reasons
wemake it the caller's job to worry     * about that.     */
 
-    result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
-    return result;
+    return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);}

Not a new issue, and *very* likely to be irrelevant in practice (given
the value is only referenced once): But there's really no guarantee
map[mapByte] is only read once here.


-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)

Not really a new issue again: The parameter types (previously return
type) to this function seem wrong to me.



@@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut        } 
+    /*
+     * We don't bother clearing *all_frozen when the page is discovered not
+     * to be all-visible, so do that now if necessary.  The page might fail
+     * to be all-frozen for other reasons anyway, but if it's not all-visible,
+     * then it definitely isn't all-frozen.
+     */
+    if (!all_visible)
+        *all_frozen = false;
+

Why don't we just set *all_frozen to false when appropriate? It'd be
just as many lines and probably easier to understand?


+        /*
+         * If the page is marked as all-visible but not all-frozen, we should
+         * so mark it.  Note that all_frozen is only valid if all_visible is
+         * true, so we must check both.
+         */

This kinda seems to imply that all-visible implies all_frozen. Also, why
has that block been added to the end of the if/else if chain? Seems like
it belongs below the (all_visible && !all_visible_according_to_vm) block.


Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

03 May 2016, 16:58:36

On Tue, May 3, 2016 at 6:48 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> The freeze map changes, besides being very important, seem to be one of
> the patches with a high risk profile in 9.6.  Robert had asked whether
> I'd take a look.  I thought it'd be a good idea to review that while
> running tests for
> http://www.postgresql.org/message-id/CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+Pce93Sftw@mail.gmail.com

Thank you for reviewing.

> For starters, I'm just going through the commits. It seems the relevant
> pieces are:
>
> a892234 Change the format of the VM fork to add a second bit per page.
> 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
> fd31cd2 Don't vacuum all-frozen pages.
> 7087166 pg_upgrade: Convert old visibility map format to new format.
> ba0a198 Add pg_visibility contrib module.
>
> did I miss anything important?
>

That's all.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Andres Freund

Date:

05 May 2016, 03:08:46

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
> 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.

Nothing to say here.

> fd31cd2 Don't vacuum all-frozen pages.

Hm. I do wonder if it's going to bite us that we don't have a way to
actually force vacuuming of the whole table (besides manually rm'ing the
VM). I've more than once seen VACUUM used to try to do some integrity
checking of the database.  How are we actually going to test that the
feature works correctly? They'd have to write checks ontop of
pg_visibility to see whether things are borked.

/* * Compute whether we actually scanned the whole relation. If we did, we * can adjust relfrozenxid and relminmxid. *
*NB: We need to check this before truncating the relation, because that * will change ->rel_pages. */

Comment is out-of-date now.

-        if (blkno == next_not_all_visible_block)
+        if (blkno == next_unskippable_block)        {
-            /* Time to advance next_not_all_visible_block */
-            for (next_not_all_visible_block++;
-                 next_not_all_visible_block < nblocks;
-                 next_not_all_visible_block++)
+            /* Time to advance next_unskippable_block */
+            for (next_unskippable_block++;
+                 next_unskippable_block < nblocks;
+                 next_unskippable_block++)

Hm. So we continue with the course of re-processing pages, even if
they're marked all-frozen. For all-visible there at least can be a
benefit by freezing earlier, but for all-frozen pages there's really no
point.  I don't really buy the arguments for the skipping logic. But
even disregarding that, maybe we should skip processing a block if it's
all-frozen (without preventing the page from being read?); as there's no
possible benefit?  Acquring the exclusive/content lock and stuff is far
from free.

Not really related to this patch, but the FORCE_CHECK_PAGE is rather
ugly.

+            /*
+             * The current block is potentially skippable; if we've seen a
+             * long enough run of skippable blocks to justify skipping it, and
+             * we're not forced to check it, then go ahead and skip.
+             * Otherwise, the page must be at least all-visible if not
+             * all-frozen, so we can set all_visible_according_to_vm = true.
+             */
+            if (skipping_blocks && !FORCE_CHECK_PAGE())
+            {
+                /*
+                 * Tricky, tricky.  If this is in aggressive vacuum, the page
+                 * must have been all-frozen at the time we checked whether it
+                 * was skippable, but it might not be any more.  We must be
+                 * careful to count it as a skipped all-frozen page in that
+                 * case, or else we'll think we can't update relfrozenxid and
+                 * relminmxid.  If it's not an aggressive vacuum, we don't
+                 * know whether it was all-frozen, so we have to recheck; but
+                 * in this case an approximate answer is OK.
+                 */
+                if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+                    vacrelstats->frozenskipped_pages++;                continue;
+            }

Hm. This indeed seems a bit tricky.  Not sure how to make it easier
though without just ripping out the SKIP_PAGES_THRESHOLD stuff.

Hm. This also doubles the number of VM accesses. While I guess that's
not noticeable most of the time, it's still not nice; especially when a
large relation is entirely frozen, because it'll mean we'll sequentially
go through the visibilityma twice.

I wondered for a minute whether #14057 could cause really bad issues
here
http://www.postgresql.org/message-id/20160331103739.8956.94469@wrigleys.postgresql.org
but I don't see it being more relevant here.

Andres

Re: Reviewing freeze map code

From

Andres Freund

Date:

05 May 2016, 21:20:15

Hi,

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
> 7087166 pg_upgrade: Convert old visibility map format to new format.

+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
...

+    while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+    {
..

Uh, shouldn't we actually fail if we read incompletely? Rather than
silently ignoring the problem? Ok, this causes no corruption, but it
indicates that something went significantly wrong.

+            char        new_vmbuf[BLCKSZ];
+            char       *new_cur = new_vmbuf;
+            bool        empty = true;
+            bool        old_lastpart;
+
+            /* Copy page header in advance */
+            memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);

Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?


+    if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+    {
+        close(src_fd);
+        return getErrorText();
+    }

I know you guys copied this, but what's the force thing about?
Expecially as it's always set to true by the callers (i.e. what is the
parameter even about?)?  Wouldn't we at least have to specify O_TRUNC in
the force case?


+                old_cur += BITS_PER_HEAPBLOCK_OLD;
+                new_cur += BITS_PER_HEAPBLOCK;

I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD
stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not
be able to have differing ones anyway, should we decide to add a third
bit?

- Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 May 2016, 23:34:59

On Mon, May 2, 2016 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote:
> + * heap_tuple_needs_eventual_freeze
> + *
> + * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
> + * will eventually require freezing.  Similar to heap_tuple_needs_freeze,
> + * but there's no cutoff, since we're trying to figure out whether freezing
> + * will ever be needed, not whether it's needed now.
> + */
> +bool
> +heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
>
> Wouldn't redefining this to heap_tuple_is_frozen() and then inverting the
> checks be easier to understand?

I thought it much safer to keep this as close to a copy of
heap_tuple_needs_freeze() as possible.  Copying a function and
inverting all of the return values is much more likely to introduce
bugs, IME.

> +       /*
> +        * If xmax is a valid xact or multixact, this tuple is also not frozen.
> +        */
> +       if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
> +       {
> +               MultiXactId multi;
> +
> +               multi = HeapTupleHeaderGetRawXmax(tuple);
> +               if (MultiXactIdIsValid(multi))
> +                       return true;
> +       }
>
> Hm. What's the test inside the if() for? There shouldn't be any case
> where xmax is invalid if HEAP_XMAX_IS_MULTI is set.   Now there's a
> check like that outside of this commit, but it seems strange to me
> (Alvaro, perhaps you could comment on this?).

Here again I was copying existing code, with appropriate simplifications.

> + *
> + * Clearing both visibility map bits is not separately WAL-logged.  The callers
>   * must make sure that whenever a bit is cleared, the bit is cleared on WAL
>   * replay of the updating operation as well.
>
> I think including "both" here makes things less clear, because it
> differentiates clearing one bit from clearing both. There's no practical
> differentce atm, but still.

I agree.

>   *
>   * VACUUM will normally skip pages for which the visibility map bit is set;
>   * such pages can't contain any dead tuples and therefore don't need vacuuming.
> - * The visibility map is not used for anti-wraparound vacuums, because
> - * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
> - * present in the table, even on pages that don't have any dead tuples.
>   *
>
> I think the remaining sentence isn't entirely accurate, there's now more
> than one bit, and they're different with regard to scan_all/!scan_all
> vacuums (or will be - maybe this updated further in a later commit? But
> if so, that sentence shouldn't yet be removed...).

We can adjust the language, but I don't really see a big problem here.

> -/* Number of heap blocks we can represent in one byte. */
> -#define HEAPBLOCKS_PER_BYTE 8
> -
> Hm, why was this moved to the header? Sounds like something the outside
> shouldn't care about.

Oh... yeah.  Let's undo that.

> #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
>
> Hm. This isn't really a mapping to an individual bit anymore - but I
> don't really have a better name in mind. Maybe TO_OFFSET?

Well, it sorta is... but we could change it, I suppose.

> +static const uint8 number_of_ones_for_visible[256] = {
> ...
> +};
> +static const uint8 number_of_ones_for_frozen[256] = {
> ...
>  };
>
> Did somebody verify the new contents are correct?

I admit that I didn't.  It seemed like an unlikely place for a goof,
but I guess we should verify.

> /*
> - *     visibilitymap_clear - clear a bit in visibility map
> + *     visibilitymap_clear - clear all bits in visibility map
>   *
>
> This seems rather easy to misunderstand, as this really only clears all
> the bits for one page, not actually all the bits.

We could change "in" to "for one page in the".

>   * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
> - * releasing *buf after it's done testing and setting bits.
> + * releasing *buf after it's done testing and setting bits, and must pass flags
> + * for which it needs to check the value in visibility map.
>   *
>   * NOTE: This function is typically called without a lock on the heap page,
>   * so somebody else could change the bit just after we look at it.  In fact,
> @@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
>
> I'm not seing what flags the above comment change is referring to?

Ugh.  I think that's leftover cruft from an earlier patch version that
should have been excised from what got committed.

>         /*
> -        * A single-bit read is atomic.  There could be memory-ordering effects
> +        * A single byte read is atomic.  There could be memory-ordering effects
>          * here, but for performance reasons we make it the caller's job to worry
>          * about that.
>          */
> -       result = (map[mapByte] & (1 << mapBit)) ? true : false;
> -
> -       return result;
> +       return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
>  }
>
> Not a new issue, and *very* likely to be irrelevant in practice (given
> the value is only referenced once): But there's really no guarantee
> map[mapByte] is only read once here.

Meh.  But we can fix if you want to.

> -BlockNumber
> -visibilitymap_count(Relation rel)
> +void
> +visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
>
> Not really a new issue again: The parameter types (previously return
> type) to this function seem wrong to me.

Not this patch's job to tinker.

> @@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
>                 }
> +       /*
> +        * We don't bother clearing *all_frozen when the page is discovered not
> +        * to be all-visible, so do that now if necessary.  The page might fail
> +        * to be all-frozen for other reasons anyway, but if it's not all-visible,
> +        * then it definitely isn't all-frozen.
> +        */
> +       if (!all_visible)
> +               *all_frozen = false;
> +
>
> Why don't we just set *all_frozen to false when appropriate? It'd be
> just as many lines and probably easier to understand?

I thought that looked really easy to mess up, either now or down the
road.  This way seemed more solid to me.  That's a judgement call, of
course.

> +               /*
> +                * If the page is marked as all-visible but not all-frozen, we should
> +                * so mark it.  Note that all_frozen is only valid if all_visible is
> +                * true, so we must check both.
> +                */
>
> This kinda seems to imply that all-visible implies all_frozen. Also, why
> has that block been added to the end of the if/else if chain? Seems like
> it belongs below the (all_visible && !all_visible_according_to_vm) block.

We can adjust the comment a bit to make it more clear, if you like,
but I doubt it's going to cause serious misunderstanding.  As for the
placement, the reason I put it at the end is because I figured that we
did not want to mark it all-frozen if any of the "oh crap, emit a
warning" cases applied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 May 2016, 23:40:47

On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
>> 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
>
> Nothing to say here.
>
>> fd31cd2 Don't vacuum all-frozen pages.
>
> Hm. I do wonder if it's going to bite us that we don't have a way to
> actually force vacuuming of the whole table (besides manually rm'ing the
> VM). I've more than once seen VACUUM used to try to do some integrity
> checking of the database.  How are we actually going to test that the
> feature works correctly? They'd have to write checks ontop of
> pg_visibility to see whether things are borked.

Let's add VACUUM (FORCE) or something like that.

>         /*
>          * Compute whether we actually scanned the whole relation. If we did, we
>          * can adjust relfrozenxid and relminmxid.
>          *
>          * NB: We need to check this before truncating the relation, because that
>          * will change ->rel_pages.
>          */
>
> Comment is out-of-date now.

OK.

> -               if (blkno == next_not_all_visible_block)
> +               if (blkno == next_unskippable_block)
>                 {
> -                       /* Time to advance next_not_all_visible_block */
> -                       for (next_not_all_visible_block++;
> -                                next_not_all_visible_block < nblocks;
> -                                next_not_all_visible_block++)
> +                       /* Time to advance next_unskippable_block */
> +                       for (next_unskippable_block++;
> +                                next_unskippable_block < nblocks;
> +                                next_unskippable_block++)
>
> Hm. So we continue with the course of re-processing pages, even if
> they're marked all-frozen. For all-visible there at least can be a
> benefit by freezing earlier, but for all-frozen pages there's really no
> point.  I don't really buy the arguments for the skipping logic. But
> even disregarding that, maybe we should skip processing a block if it's
> all-frozen (without preventing the page from being read?); as there's no
> possible benefit?  Acquring the exclusive/content lock and stuff is far
> from free.

I wanted to tinker with this logic as little as possible in the
interest of ending up with something that worked.  I would not have
written it this way.

> Not really related to this patch, but the FORCE_CHECK_PAGE is rather
> ugly.

+1.

> +                       /*
> +                        * The current block is potentially skippable; if we've seen a
> +                        * long enough run of skippable blocks to justify skipping it, and
> +                        * we're not forced to check it, then go ahead and skip.
> +                        * Otherwise, the page must be at least all-visible if not
> +                        * all-frozen, so we can set all_visible_according_to_vm = true.
> +                        */
> +                       if (skipping_blocks && !FORCE_CHECK_PAGE())
> +                       {
> +                               /*
> +                                * Tricky, tricky.  If this is in aggressive vacuum, the page
> +                                * must have been all-frozen at the time we checked whether it
> +                                * was skippable, but it might not be any more.  We must be
> +                                * careful to count it as a skipped all-frozen page in that
> +                                * case, or else we'll think we can't update relfrozenxid and
> +                                * relminmxid.  If it's not an aggressive vacuum, we don't
> +                                * know whether it was all-frozen, so we have to recheck; but
> +                                * in this case an approximate answer is OK.
> +                                */
> +                               if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
> +                                       vacrelstats->frozenskipped_pages++;
>                                 continue;
> +                       }
>
> Hm. This indeed seems a bit tricky.  Not sure how to make it easier
> though without just ripping out the SKIP_PAGES_THRESHOLD stuff.

Yep, I had the same problem.

> Hm. This also doubles the number of VM accesses. While I guess that's
> not noticeable most of the time, it's still not nice; especially when a
> large relation is entirely frozen, because it'll mean we'll sequentially
> go through the visibilityma twice.

Compared to what we're saving, that's obviously a trivial cost.
That's not to say that we might not want to improve it, but it's
hardly a disaster.

In short: wah, wah, wah.

> I wondered for a minute whether #14057 could cause really bad issues
> here
> http://www.postgresql.org/message-id/20160331103739.8956.94469@wrigleys.postgresql.org
> but I don't see it being more relevant here.

I don't really understand what the concern is here, but if it's not a
problem, let's not spend time trying to clarify.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 May 2016, 23:42:52

On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
>> 7087166 pg_upgrade: Convert old visibility map format to new format.
>
> +const char *
> +rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
> ...
>
> +       while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
> +       {
> ..
>
> Uh, shouldn't we actually fail if we read incompletely? Rather than
> silently ignoring the problem? Ok, this causes no corruption, but it
> indicates that something went significantly wrong.

Sure, that's reasonable.

> +                       char            new_vmbuf[BLCKSZ];
> +                       char       *new_cur = new_vmbuf;
> +                       bool            empty = true;
> +                       bool            old_lastpart;
> +
> +                       /* Copy page header in advance */
> +                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
>
> Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
> with old_lastpart && !empty, right?

Oh, dear.  That seems like a possible data corruption bug.  Maybe we'd
better fix that right away (although I don't actually have time before
the wrap).

> +       if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
> +       {
> +               close(src_fd);
> +               return getErrorText();
> +       }
>
> I know you guys copied this, but what's the force thing about?
> Expecially as it's always set to true by the callers (i.e. what is the
> parameter even about?)?  Wouldn't we at least have to specify O_TRUNC in
> the force case?

I just work here.

> +                               old_cur += BITS_PER_HEAPBLOCK_OLD;
> +                               new_cur += BITS_PER_HEAPBLOCK;
>
> I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD
> stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not
> be able to have differing ones anyway, should we decide to add a third
> bit?

I think that's just a matter of style.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

06 May 2016, 23:48:21

On 05/06/2016 01:40 PM, Robert Haas wrote:
> On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
>>> 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
>>
>> Nothing to say here.
>>
>>> fd31cd2 Don't vacuum all-frozen pages.
>>
>> Hm. I do wonder if it's going to bite us that we don't have a way to
>> actually force vacuuming of the whole table (besides manually rm'ing the
>> VM). I've more than once seen VACUUM used to try to do some integrity
>> checking of the database.  How are we actually going to test that the
>> feature works correctly? They'd have to write checks ontop of
>> pg_visibility to see whether things are borked.
>
> Let's add VACUUM (FORCE) or something like that.

This is actually inverted. Vacuum by default should vacuum the entire 
relation, however if we are going to keep the existing behavior of this 
patch, VACUUM (FROZEN) seems to be better than (FORCE)?

Sincerely,

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Andres Freund

Date:

06 May 2016, 23:50:19

On 2016-05-06 13:48:09 -0700, Joshua D. Drake wrote:
> On 05/06/2016 01:40 PM, Robert Haas wrote:
> > On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
> > > > 77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
> > >
> > > Nothing to say here.
> > >
> > > > fd31cd2 Don't vacuum all-frozen pages.
> > >
> > > Hm. I do wonder if it's going to bite us that we don't have a way to
> > > actually force vacuuming of the whole table (besides manually rm'ing the
> > > VM). I've more than once seen VACUUM used to try to do some integrity
> > > checking of the database.  How are we actually going to test that the
> > > feature works correctly? They'd have to write checks ontop of
> > > pg_visibility to see whether things are borked.
> >
> > Let's add VACUUM (FORCE) or something like that.

Yes, that makes sense.


> This is actually inverted. Vacuum by default should vacuum the entire
> relation

What? Why on earth would that be a good idea? Not to speak of hte fact
that that's not been the case since ~8.4?


>,however if we are going to keep the existing behavior of this
> patch, VACUUM (FROZEN) seems to be better than (FORCE)?

There already is FREEZE - meaning something different - so I doubt it.

Andres

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

06 May 2016, 23:54:18

On 05/06/2016 01:50 PM, Andres Freund wrote:

>>> Let's add VACUUM (FORCE) or something like that.
>
> Yes, that makes sense.
>
>
>> This is actually inverted. Vacuum by default should vacuum the entire
>> relation
>
> What? Why on earth would that be a good idea? Not to speak of hte fact
> that that's not been the case since ~8.4?

Sorry, I just meant the default behavior shouldn't change but I do agree 
that we need the ability to keep the same behavior.

>> ,however if we are going to keep the existing behavior of this
>> patch, VACUUM (FROZEN) seems to be better than (FORCE)?
>
> There already is FREEZE - meaning something different - so I doubt it.

Yeah I thought about that, it is the word "FORCE" that bothers me. When 
you use FORCE there is an assumption that no matter what, it plows 
through (think rm -f). So if we don't use FROZEN, that's cool but FORCE 
doesn't work either.

Sincerely,

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Stephen Frost

Date:

06 May 2016, 23:58:25

* Joshua D. Drake (jd@commandprompt.com) wrote:
> Yeah I thought about that, it is the word "FORCE" that bothers me.
> When you use FORCE there is an assumption that no matter what, it
> plows through (think rm -f). So if we don't use FROZEN, that's cool
> but FORCE doesn't work either.

Isn't that exactly what this FORCE option being contemplated would do
though?  Plow through the entire relation, regardless of what the VM
says is all frozen or not?

Seems like FORCE is a good word for that to me.

Thanks!

Stephen

Re: Reviewing freeze map code

From

Andres Freund

Date:

06 May 2016, 23:58:30

On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:
> On 05/06/2016 01:50 PM, Andres Freund wrote:
> 
> > > > Let's add VACUUM (FORCE) or something like that.
> > 
> > Yes, that makes sense.
> > 
> > 
> > > This is actually inverted. Vacuum by default should vacuum the entire
> > > relation
> > 
> > What? Why on earth would that be a good idea? Not to speak of hte fact
> > that that's not been the case since ~8.4?
> 
> Sorry, I just meant the default behavior shouldn't change but I do agree
> that we need the ability to keep the same behavior.

Which default behaviour shouldn't change? The one in master where we
skip known frozen pages? Or the released branches where can't skip those?

> > > ,however if we are going to keep the existing behavior of this
> > > patch, VACUUM (FROZEN) seems to be better than (FORCE)?
> > 
> > There already is FREEZE - meaning something different - so I doubt it.
> 
> Yeah I thought about that, it is the word "FORCE" that bothers me. When you
> use FORCE there is an assumption that no matter what, it plows through
> (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
> either.

SCANALL?

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

07 May 2016, 00:01:08

On 05/06/2016 01:58 PM, Stephen Frost wrote:
> * Joshua D. Drake (jd@commandprompt.com) wrote:
>> Yeah I thought about that, it is the word "FORCE" that bothers me.
>> When you use FORCE there is an assumption that no matter what, it
>> plows through (think rm -f). So if we don't use FROZEN, that's cool
>> but FORCE doesn't work either.
>
> Isn't that exactly what this FORCE option being contemplated would do
> though?  Plow through the entire relation, regardless of what the VM
> says is all frozen or not?
>
> Seems like FORCE is a good word for that to me.

Except that we aren't FORCING a vacuum. That is the part I have 
contention with. To me, FORCE means:

No matter what else is happening, we are vacuuming this relation (think 
locks).

But I am also not going to dig in my heals. If that is truly what 
-hackers come up with, thank you at least considering what I said.

Sincerely,

JD

>
> Thanks!
>
> Stephen
>


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Josh berkus

Date:

07 May 2016, 00:01:42

On 05/06/2016 01:58 PM, Andres Freund wrote:
> On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:
>> On 05/06/2016 01:50 PM, Andres Freund wrote:

>>> There already is FREEZE - meaning something different - so I doubt it.
>>
>> Yeah I thought about that, it is the word "FORCE" that bothers me. When you
>> use FORCE there is an assumption that no matter what, it plows through
>> (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
>> either.
> 
> SCANALL?
> 

VACUUM THEWHOLEDAMNTHING


-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

07 May 2016, 00:03:18

On 05/06/2016 02:01 PM, Josh berkus wrote:
> On 05/06/2016 01:58 PM, Andres Freund wrote:
>> On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:
>>> On 05/06/2016 01:50 PM, Andres Freund wrote:
>
>>>> There already is FREEZE - meaning something different - so I doubt it.
>>>
>>> Yeah I thought about that, it is the word "FORCE" that bothers me. When you
>>> use FORCE there is an assumption that no matter what, it plows through
>>> (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
>>> either.
>>
>> SCANALL?
>>
>
> VACUUM THEWHOLEDAMNTHING
>

I know that would never fly but damn if that wouldn't be an awesome 
keyword for VACUUM.

JD


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Stephen Frost

Date:

07 May 2016, 00:03:58

* Josh berkus (josh@agliodbs.com) wrote:
> On 05/06/2016 01:58 PM, Andres Freund wrote:
> > On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:
> >> On 05/06/2016 01:50 PM, Andres Freund wrote:
>
> >>> There already is FREEZE - meaning something different - so I doubt it.
> >>
> >> Yeah I thought about that, it is the word "FORCE" that bothers me. When you
> >> use FORCE there is an assumption that no matter what, it plows through
> >> (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
> >> either.
> >
> > SCANALL?
> >
>
> VACUUM THEWHOLEDAMNTHING

+100

(hahahaha)

Thanks!

Stephen

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 May 2016, 00:08:11

On 2016-05-06 14:03:11 -0700, Joshua D. Drake wrote:
> On 05/06/2016 02:01 PM, Josh berkus wrote:
> > On 05/06/2016 01:58 PM, Andres Freund wrote:
> > > On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:
> > > > On 05/06/2016 01:50 PM, Andres Freund wrote:
> > 
> > > > > There already is FREEZE - meaning something different - so I doubt it.
> > > > 
> > > > Yeah I thought about that, it is the word "FORCE" that bothers me. When you
> > > > use FORCE there is an assumption that no matter what, it plows through
> > > > (think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
> > > > either.
> > > 
> > > SCANALL?
> > > 
> > 
> > VACUUM THEWHOLEDAMNTHING
> > 
> 
> I know that would never fly but damn if that wouldn't be an awesome keyword
> for VACUUM.

It bothers me more than it probably should: Nobdy tests, reviews,
whatever a complex patch with significant data-loss potential. But as
soon somebody dares to mention an option name...

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

07 May 2016, 00:08:21

On 05/06/2016 02:03 PM, Stephen Frost wrote:

>>
>> VACUUM THEWHOLEDAMNTHING
>
> +100
>
> (hahahaha)

You know what? Why not? Seriously? We aren't product. This is supposed 
to be a bit fun. Let's have some fun with it? It would be so easy to 
turn that into a positive advocacy opportunity.

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Josh berkus

Date:

07 May 2016, 00:10:11

On 05/06/2016 02:08 PM, Andres Freund wrote:

> It bothers me more than it probably should: Nobdy tests, reviews,
> whatever a complex patch with significant data-loss potential. But as
> soon somebody dares to mention an option name...

Definitely more than it should, because it's gonna happen *every* time.

https://en.wikipedia.org/wiki/Law_of_triviality

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 May 2016, 00:12:59

On 2016-05-06 14:10:04 -0700, Josh berkus wrote:
> On 05/06/2016 02:08 PM, Andres Freund wrote:
> 
> > It bothers me more than it probably should: Nobdy tests, reviews,
> > whatever a complex patch with significant data-loss potential. But as
> > soon somebody dares to mention an option name...
> 
> Definitely more than it should, because it's gonna happen *every* time.
> 
> https://en.wikipedia.org/wiki/Law_of_triviality

Doesn't mean it should not be frowned upon.

Re: Reviewing freeze map code

From

Josh berkus

Date:

07 May 2016, 00:15:53

On 05/06/2016 02:12 PM, Andres Freund wrote:
> On 2016-05-06 14:10:04 -0700, Josh berkus wrote:
>> On 05/06/2016 02:08 PM, Andres Freund wrote:
>>
>>> It bothers me more than it probably should: Nobdy tests, reviews,
>>> whatever a complex patch with significant data-loss potential. But as
>>> soon somebody dares to mention an option name...
>>
>> Definitely more than it should, because it's gonna happen *every* time.
>>
>> https://en.wikipedia.org/wiki/Law_of_triviality
> 
> Doesn't mean it should not be frowned upon.

Or made light of, hence my post.  Personally I don't care what the
option is called, as long as we have docs for it.

For the serious testing, does anyone have a good technique for creating
loads which would stress-test vacuum freezing?  It's hard for me to come
up with anything which wouldn't be very time-and-resource intensive
(like running at 10,000 TPS for a week).

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

07 May 2016, 00:17:17

On 05/06/2016 02:08 PM, Andres Freund wrote:

>>> VACUUM THEWHOLEDAMNTHING
>>>
>>
>> I know that would never fly but damn if that wouldn't be an awesome keyword
>> for VACUUM.
>
> It bothers me more than it probably should: Nobdy tests, reviews,
> whatever a complex patch with significant data-loss potential. But as
> soon somebody dares to mention an option name...

That is a fair complaint but let me ask you something:

How do I test?

Is there a script I can run? Are there specific things I can do to try 
and break it? What are we looking for exactly?

A lot of -hackers seem to forget that although we have 100 -hackers, we 
have 10000 "consultant/practitioners". Could I read the code and with a 
weekend of WTF and -hackers questions figure out what is going on, yes 
but a lot of people couldn't and I don't have the time.

You want me (or people like me) to test more? Give us an easy way to do 
it. Otherwise, we do what we can, which is try and interface on the 
things that will directly and immediately affect us (like keywords and 
syntax).

Sincerely,

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 May 2016, 00:20:21

On 2016-05-06 14:15:47 -0700, Josh berkus wrote:
> For the serious testing, does anyone have a good technique for creating
> loads which would stress-test vacuum freezing?  It's hard for me to come
> up with anything which wouldn't be very time-and-resource intensive
> (like running at 10,000 TPS for a week).

I've changed the limits for freezing options a while back, so you can
now set autovacuum_freeze_max as low as 100000 (best set
vacuum_freeze_table_age accordingly).  You'll have to come up with a
workload that doesn't overwrite all data continuously (otherwise
there'll never be old rows), but otherwise it should now be fairly easy
to test that kind of scenario.

Andres

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 May 2016, 00:30:03

Hi,

On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote:
> How do I test?
> 
> Is there a script I can run?

Unfortunately there's few interesting things to test with pre-made
scripts. There's no relevant OS dependency here, so each already
existing test doesn't really lead to significantly increased coverage
being run by other people.  Generally, when testing for correctness
issues, it's often of limited benefit to run tests written by the author
of reviewer - such scripts will usually just test things either has
thought of.  The dangerous areas are the ones neither author or reviewer
has considered.

> Are there specific things I can do to try and break it?

Upgrade clusters using pg_upgrade and make sure things like index only
scans still work and yield correct data.  Set up workloads that involve
freezing, and check that less WAL (and not more!) is generated with 9.6
than with 9.5.  Make sure queries still work.

> What are we looking for exactly?

Data corruption, efficiency problems.

> A lot of -hackers seem to forget that although we have 100 -hackers, we have
> 10000 "consultant/practitioners". Could I read the code and with a weekend
> of WTF and -hackers questions figure out what is going on, yes but a lot of
> people couldn't and I don't have the time.

I think tests without reading the code are quite sensible and
important. And it perfectly makes sense to ask for information about
what to test.  But fundamentally testing is a lot of work, as is writing
and reviewing code; unless you're really really good at destructive
testing, you won't find much in a 15 minute break.

> You want me (or people like me) to test more? Give us an easy way to
> do it.

Useful additional testing and easy just don't go well together. By the
time I have made it easy I've done the testing that's needed.

> Otherwise, we do what we can, which is try and interface on the things that
> will directly and immediately affect us (like keywords and syntax).

The amount of bikeshedding on -hackers steals energy and time for
actually working on stuff, including testing. So I have little sympathy
for the amount of bike shedding done.

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

07 May 2016, 00:31:18

Joshua D. Drake wrote:
> On 05/06/2016 01:40 PM, Robert Haas wrote:
> >On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:
> >>On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
> >>>77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
> >>
> >>Nothing to say here.
> >>
> >>>fd31cd2 Don't vacuum all-frozen pages.
> >>
> >>Hm. I do wonder if it's going to bite us that we don't have a way to
> >>actually force vacuuming of the whole table (besides manually rm'ing the
> >>VM). I've more than once seen VACUUM used to try to do some integrity
> >>checking of the database.  How are we actually going to test that the
> >>feature works correctly? They'd have to write checks ontop of
> >>pg_visibility to see whether things are borked.
> >
> >Let's add VACUUM (FORCE) or something like that.
> 
> This is actually inverted. Vacuum by default should vacuum the entire
> relation, however if we are going to keep the existing behavior of this
> patch, VACUUM (FROZEN) seems to be better than (FORCE)?

Prior to some 7.x release, VACUUM actually did what we ripped out in
9.0 release as VACUUM FULL.  We actually changed the mode of operation
quite heavily into the "lazy" mode which didn't acquire access exclusive
lock, and it was a huge relief.  I think that changing the mode of
operation to be the lightest possible thing that gets the job done
is convenient for users, because their existing scripts continue to
clean their tables only they take less time.  No need to tweak the
maintenance scripts.

I don't know what happens when the freeze_table_age threshold is
reached.  Do we scan the whole table when that happens?  Because if we
do, then we don't need a new keyword: just invoke the command after
lowering the setting.

Another question on this feature is what happens with the table age
(relfrozenxid, relminmxid) when the table is not wholly scanned by
vacuum.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

07 May 2016, 00:37:00

Andres Freund wrote:

> On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote:
> > How do I test?
> > 
> > Is there a script I can run?
> 
> Unfortunately there's few interesting things to test with pre-made
> scripts. There's no relevant OS dependency here, so each already
> existing test doesn't really lead to significantly increased coverage
> being run by other people.  Generally, when testing for correctness
> issues, it's often of limited benefit to run tests written by the author
> of reviewer - such scripts will usually just test things either has
> thought of.  The dangerous areas are the ones neither author or reviewer
> has considered.

We touched this question in connection with multixact freezing and
wraparound.  Testers seem to want to be given a script that they can
install and run, then go for a beer and get back to a bunch of errors to
report.  But it doesn't work that way; writing a useful test script
requires a lot of effort.  Jeff Janes has done astounding work in these
matters.  (I don't think we credit him enough for that.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

07 May 2016, 00:40:06

On 05/06/2016 02:29 PM, Andres Freund wrote:
> Hi,
>
>
> On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote:
>> How do I test?
>>
>> Is there a script I can run?
>
> Unfortunately there's few interesting things to test with pre-made
> scripts. There's no relevant OS dependency here, so each already
> existing test doesn't really lead to significantly increased coverage
> being run by other people.  Generally, when testing for correctness
> issues, it's often of limited benefit to run tests written by the author
> of reviewer - such scripts will usually just test things either has
> thought of.  The dangerous areas are the ones neither author or reviewer
> has considered.

I can't argue with that.

>
>
>> Are there specific things I can do to try and break it?
>
> Upgrade clusters using pg_upgrade and make sure things like index only
> scans still work and yield correct data.  Set up workloads that involve
> freezing, and check that less WAL (and not more!) is generated with 9.6
> than with 9.5.  Make sure queries still work.
>
>
>> What are we looking for exactly?
>
> Data corruption, efficiency problems.
>

I am really not trying to be difficult here but Data Corruption is an 
easy one... what is the metric we accept as an efficiency problem?

>
>> A lot of -hackers seem to forget that although we have 100 -hackers, we have
>> 10000 "consultant/practitioners". Could I read the code and with a weekend
>> of WTF and -hackers questions figure out what is going on, yes but a lot of
>> people couldn't and I don't have the time.
>
> I think tests without reading the code are quite sensible and
> important. And it perfectly makes sense to ask for information about
> what to test.  But fundamentally testing is a lot of work, as is writing
> and reviewing code; unless you're really really good at destructive
> testing, you won't find much in a 15 minute break.
>

Yes, this is true but with a proper testing framework, I don't need a 15 
minute break. I need 1 hour to configure, the rest just "happens" and 
reports back.

I have cycles to test, I have team members to help test (as do *lots* of 
other people) but sometimes we just get lost in how to help.

>
>> You want me (or people like me) to test more? Give us an easy way to
>> do it.
>
> Useful additional testing and easy just don't go well together. By the
> time I have made it easy I've done the testing that's needed.

I don't know that I can agree with this. A proper harness allows you to 
execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I 
will not argue that it isn't easy to implement but I know it can be done.

>> Otherwise, we do what we can, which is try and interface on the things that
>> will directly and immediately affect us (like keywords and syntax).
>
> The amount of bikeshedding on -hackers steals energy and time for
> actually working on stuff, including testing. So I have little sympathy
> for the amount of bike shedding done.

Insuring a reasonable and thought out interface for users is not bike 
shedding, it is at least as important and possibly more important than 
any feature we add.

Sincerely,

JD

>
> Greetings,
>
> Andres Freund
>


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 May 2016, 00:48:56

On 2016-05-06 14:39:57 -0700, Joshua D. Drake wrote:
> > > What are we looking for exactly?
> > 
> > Data corruption, efficiency problems.
> > 
> 
> I am really not trying to be difficult here but Data Corruption is an easy
> one... what is the metric we accept as an efficiency problem?

That's indeed not easy to define.  In this case I'd say vacuums taking
longer, index only scans being slower, more WAL being generated would
count?


> > I think tests without reading the code are quite sensible and
> > important. And it perfectly makes sense to ask for information about
> > what to test.  But fundamentally testing is a lot of work, as is writing
> > and reviewing code; unless you're really really good at destructive
> > testing, you won't find much in a 15 minute break.
> > 
> 
> Yes, this is true but with a proper testing framework, I don't need a 15
> minute break. I need 1 hour to configure, the rest just "happens" and
> reports back.

That only works if somebody writes such tests. And in that case the
tester having run will often suffice (until related changes are being
made). I'm not arguing against introducing more tests into the codebase
- I rather fervently for that. But that really isn't what's going to
avoid issues like this feature (or multixact) causing problems, because
those tests will just test what the author thought of.


> > > You want me (or people like me) to test more? Give us an easy way to
> > > do it.
> > 
> > Useful additional testing and easy just don't go well together. By the
> > time I have made it easy I've done the testing that's needed.
> 
> I don't know that I can agree with this. A proper harness allows you to
> execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I will
> not argue that it isn't easy to implement but I know it can be done.

The problem is that the contents of go.sh are the much more relevant
part than the 8 hours.


Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 May 2016, 00:49:49

On 2016-05-06 18:36:52 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote:
> > > How do I test?
> > > 
> > > Is there a script I can run?
> > 
> > Unfortunately there's few interesting things to test with pre-made
> > scripts. There's no relevant OS dependency here, so each already
> > existing test doesn't really lead to significantly increased coverage
> > being run by other people.  Generally, when testing for correctness
> > issues, it's often of limited benefit to run tests written by the author
> > of reviewer - such scripts will usually just test things either has
> > thought of.  The dangerous areas are the ones neither author or reviewer
> > has considered.
> 
> We touched this question in connection with multixact freezing and
> wraparound.  Testers seem to want to be given a script that they can
> install and run, then go for a beer and get back to a bunch of errors to
> report.  But it doesn't work that way; writing a useful test script
> requires a lot of effort.

Right. And once written, often enough running it on a lot more instances
only marginally increases the coverage.


> Jeff Janes has done astounding work in these matters.  (I don't think
> we credit him enough for that.)

+many.

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

07 May 2016, 00:54:26

On 05/06/2016 02:48 PM, Andres Freund wrote:
> On 2016-05-06 14:39:57 -0700, Joshua D. Drake wrote:

>> Yes, this is true but with a proper testing framework, I don't need a 15
>> minute break. I need 1 hour to configure, the rest just "happens" and
>> reports back.
>
> That only works if somebody writes such tests.

Agreed.

> And in that case the
> tester having run will often suffice (until related changes are being
> made). I'm not arguing against introducing more tests into the codebase
> - I rather fervently for that. But that really isn't what's going to
> avoid issues like this feature (or multixact) causing problems, because
> those tests will just test what the author thought of.
>

Good point. I am not sure how to address the alternative though.

>
>>>> You want me (or people like me) to test more? Give us an easy way to
>>>> do it.
>>>
>>> Useful additional testing and easy just don't go well together. By the
>>> time I have made it easy I've done the testing that's needed.
>>
>> I don't know that I can agree with this. A proper harness allows you to
>> execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I will
>> not argue that it isn't easy to implement but I know it can be done.
>
> The problem is that the contents of go.sh are the much more relevant
> part than the 8 hours.

True.

Please don't misunderstand, I am not saying this is "easy". I just hope 
that it is something we work for.

Sincerely,

JD



-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 May 2016, 00:54:32

On 2016-05-06 18:31:03 -0300, Alvaro Herrera wrote:
> I don't know what happens when the freeze_table_age threshold is
> reached.

We scan all non-frozen pages, whereas we earlier had to scan all pages.

That's really both the significant benefit, and the danger. Because if
we screw up the all-frozen bits in the visibilitymap, we'll be screwed
soon after.

> Do we scan the whole table when that happens?

No, there's atm no way to force a whole-table vacuum, besides manually
rm'ing the _vm fork.

> Another question on this feature is what happens with the table age
> (relfrozenxid, relminmxid) when the table is not wholly scanned by
> vacuum.

Basically we increase the horizons whenever scanning all pages that are
not known to be frozen (+ potentially some frozen ones due to the
skipping logic). Without that there'd really not be a point in the
freeze map feature, as we'd continue to have the expensive anti
wraparound vacuums.

Andres

Re: Reviewing freeze map code

From

Peter Geoghegan

Date:

07 May 2016, 00:55:49

On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote:
>> Jeff Janes has done astounding work in these matters.  (I don't think
>> we credit him enough for that.)
>
> +many.

Agreed. I'm a huge fan of what Jeff has been able to do in this area.
I often say so. It would be even better if Jeff's approach to testing
was followed as an example by other people, but I wouldn't bet on it
ever happening. It requires real persistence and deep understanding to
do well.

-- 
Peter Geoghegan

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 May 2016, 01:03:17

On 2016-05-07 10:00:27 +1200, Thomas Munro wrote:
> On Sat, May 7, 2016 at 8:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> Did somebody verify the new contents are correct?
> >
> > I admit that I didn't.  It seemed like an unlikely place for a goof,
> > but I guess we should verify.
> 
> Looks correct.  The tables match the output of the attached script.

Great!

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

07 May 2016, 01:08:03

Alvaro Herrera wrote:

> We touched this question in connection with multixact freezing and
> wraparound.  Testers seem to want to be given a script that they can
> install and run, then go for a beer and get back to a bunch of errors to
> report.

Here I spent some time trying to explain what to test to try and find
certain multixact bugs
http://www.postgresql.org/message-id/20150605213832.GZ133018@postgresql.org

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

07 May 2016, 17:09:34

On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On 05/06/2016 01:58 PM, Stephen Frost wrote:
>>
>> * Joshua D. Drake (jd@commandprompt.com) wrote:
>>>
>>> Yeah I thought about that, it is the word "FORCE" that bothers me.
>>> When you use FORCE there is an assumption that no matter what, it
>>> plows through (think rm -f). So if we don't use FROZEN, that's cool
>>> but FORCE doesn't work either.
>>
>>
>> Isn't that exactly what this FORCE option being contemplated would do
>> though?  Plow through the entire relation, regardless of what the VM
>> says is all frozen or not?
>>
>> Seems like FORCE is a good word for that to me.
>
>
> Except that we aren't FORCING a vacuum. That is the part I have contention
> with. To me, FORCE means:
>
> No matter what else is happening, we are vacuuming this relation (think
> locks).
>
> But I am also not going to dig in my heals. If that is truly what -hackers
> come up with, thank you at least considering what I said.
>
> Sincerely,
>
> JD
>

As Joshua mentioned, FORCE word might imply doing VACUUM while plowing
through locks.
I guess that it might confuse the users.
IMO, since this option will be a way for emergency, SCANALL word works for me.

Or other ideas are,
VACUUM IGNOREVM
VACUUM RESCURE

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

08 May 2016, 09:19:15

On Sat, May 7, 2016 at 11:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> On 05/06/2016 01:58 PM, Stephen Frost wrote:
>>>
>>> * Joshua D. Drake (jd@commandprompt.com) wrote:
>>>>
>>>> Yeah I thought about that, it is the word "FORCE" that bothers me.
>>>> When you use FORCE there is an assumption that no matter what, it
>>>> plows through (think rm -f). So if we don't use FROZEN, that's cool
>>>> but FORCE doesn't work either.
>>>
>>>
>>> Isn't that exactly what this FORCE option being contemplated would do
>>> though?  Plow through the entire relation, regardless of what the VM
>>> says is all frozen or not?
>>>
>>> Seems like FORCE is a good word for that to me.
>>
>>
>> Except that we aren't FORCING a vacuum. That is the part I have contention
>> with. To me, FORCE means:
>>
>> No matter what else is happening, we are vacuuming this relation (think
>> locks).
>>
>> But I am also not going to dig in my heals. If that is truly what -hackers
>> come up with, thank you at least considering what I said.
>>
>> Sincerely,
>>
>> JD
>>
>
> As Joshua mentioned, FORCE word might imply doing VACUUM while plowing
> through locks.
> I guess that it might confuse the users.
> IMO, since this option will be a way for emergency, SCANALL word works for me.
>
> Or other ideas are,
> VACUUM IGNOREVM
> VACUUM RESCURE
>

Oops, VACUUM RESCUE is correct.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Robert Haas

Date:

09 May 2016, 22:53:27

On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached draft patch adds SCANALL option to VACUUM in order to scan
> all pages forcibly while ignoring visibility map information.
> The option name is SCANALL for now but we could change it after got consensus.

If we're going to go that way, I'd say it should be scan_all rather
than scanall.  Makes it clearer, at least IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Ants Aasma

Date:

10 May 2016, 02:40:17

On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached draft patch adds SCANALL option to VACUUM in order to scan
>> all pages forcibly while ignoring visibility map information.
>> The option name is SCANALL for now but we could change it after got consensus.
>
> If we're going to go that way, I'd say it should be scan_all rather
> than scanall.  Makes it clearer, at least IMHO.

Just to add some diversity to opinions, maybe there should be a
separate command for performing integrity checks. Currently the best
ways to actually verify database correctness do so as a side effect.
The question that I get pretty much every time after I explain why we
have data checksums, is "how do I check that they are correct" and we
don't have a nice answer for that now. We could also use some ways to
sniff out corrupted rows that don't involve crashing the server in a
loop. Vacuuming pages that supposedly don't need vacuuming just to
verify integrity seems very much in the same vein.

I know right now isn't exactly the best time to hastily slap on such a
feature, but I just wanted the thought to be out there for
consideration.

Regards,
Ants Aasma

Re: Reviewing freeze map code

From

Robert Haas

Date:

10 May 2016, 17:31:02

On Mon, May 9, 2016 at 7:40 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Attached draft patch adds SCANALL option to VACUUM in order to scan
>>> all pages forcibly while ignoring visibility map information.
>>> The option name is SCANALL for now but we could change it after got consensus.
>>
>> If we're going to go that way, I'd say it should be scan_all rather
>> than scanall.  Makes it clearer, at least IMHO.
>
> Just to add some diversity to opinions, maybe there should be a
> separate command for performing integrity checks. Currently the best
> ways to actually verify database correctness do so as a side effect.
> The question that I get pretty much every time after I explain why we
> have data checksums, is "how do I check that they are correct" and we
> don't have a nice answer for that now. We could also use some ways to
> sniff out corrupted rows that don't involve crashing the server in a
> loop. Vacuuming pages that supposedly don't need vacuuming just to
> verify integrity seems very much in the same vein.
>
> I know right now isn't exactly the best time to hastily slap on such a
> feature, but I just wanted the thought to be out there for
> consideration.

I think that it's quite reasonable to have ways of performing an
integrity check that are separate from VACUUM, but this is about
having a way to force VACUUM to scan all-frozen pages - and it's hard
to imagine that we want a different command name for that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Jim Nasby

Date:

11 May 2016, 07:38:19

On 5/6/16 4:20 PM, Andres Freund wrote:
> On 2016-05-06 14:15:47 -0700, Josh berkus wrote:
>> For the serious testing, does anyone have a good technique for creating
>> loads which would stress-test vacuum freezing?  It's hard for me to come
>> up with anything which wouldn't be very time-and-resource intensive
>> (like running at 10,000 TPS for a week).
>
> I've changed the limits for freezing options a while back, so you can
> now set autovacuum_freeze_max as low as 100000 (best set
> vacuum_freeze_table_age accordingly).  You'll have to come up with a
> workload that doesn't overwrite all data continuously (otherwise
> there'll never be old rows), but otherwise it should now be fairly easy
> to test that kind of scenario.

There's also been a tool for forcibly advancing XID floating around for 
quite some time. Using that could have the added benefit of verifying 
anti-wrap still works correctly. (Might be worth testing mxid wrap too...)
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Reviewing freeze map code

From

Jim Nasby

Date:

11 May 2016, 07:43:09

On 5/6/16 4:55 PM, Peter Geoghegan wrote:
> On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote:
>>> Jeff Janes has done astounding work in these matters.  (I don't think
>>> we credit him enough for that.)
>>
>> +many.
>
> Agreed. I'm a huge fan of what Jeff has been able to do in this area.
> I often say so. It would be even better if Jeff's approach to testing
> was followed as an example by other people, but I wouldn't bet on it
> ever happening. It requires real persistence and deep understanding to
> do well.

It takes deep understanding to *design* the tests, not to write them. 
There's a lot of folks out there that will never understand enough to 
design tests meant to expose data corruption but who could easily code 
someone else's design, especially if we provided tools/ways to tweak a 
cluster to make testing easier/faster (such as artificially advancing 
XID/MXID).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Reviewing freeze map code

From

Jim Nasby

Date:

11 May 2016, 07:46:35

On 5/6/16 4:08 PM, Joshua D. Drake wrote:
>>>
>>> VACUUM THEWHOLEDAMNTHING
>>
>> +100
>>
>> (hahahaha)
>
> You know what? Why not? Seriously? We aren't product. This is supposed
> to be a bit fun. Let's have some fun with it? It would be so easy to
> turn that into a positive advocacy opportunity.

Honestly, for an option this obscure, I agree. I don't think we'd want 
any normally used stuff named so glibly, but I sure as heck could have 
used some easter-eggs like this when I was doing training.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Reviewing freeze map code

From

Jim Nasby

Date:

11 May 2016, 07:54:47

On 5/10/16 11:42 PM, Jim Nasby wrote:
> On 5/6/16 4:55 PM, Peter Geoghegan wrote:
>> On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote:
>>>> Jeff Janes has done astounding work in these matters.  (I don't think
>>>> we credit him enough for that.)
>>>
>>> +many.
>>
>> Agreed. I'm a huge fan of what Jeff has been able to do in this area.
>> I often say so. It would be even better if Jeff's approach to testing
>> was followed as an example by other people, but I wouldn't bet on it
>> ever happening. It requires real persistence and deep understanding to
>> do well.
>
> It takes deep understanding to *design* the tests, not to write them.
> There's a lot of folks out there that will never understand enough to
> design tests meant to expose data corruption but who could easily code
> someone else's design, especially if we provided tools/ways to tweak a
> cluster to make testing easier/faster (such as artificially advancing
> XID/MXID).

Speaking of which, another email in the thread made me realize that 
there's a test condition no one has mentioned: verifying we don't lose 
tuples after wraparound.

To test this, you'd want a table that's mostly frozen. Ideally, dirty a 
single tuple on a bunch of frozen pages, with committed updates, 
deletes, and un-committed inserts. Advance XID far enough to get you 
close to wrap-around. Do a vacuum, SELECT count(*), advance XID past 
wraparound, SELECT count(*) again and you should get the same number.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Reviewing freeze map code

From

Robert Haas

Date:

16 May 2016, 17:49:18

On Tue, May 10, 2016 at 10:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Or second way I came up with is having tool to remove particular _vm
> file safely, which is executed via SQL or client tool like
> pg_resetxlog.
>
> Attached updated VACUUM SCAN_ALL patch.
> Please find it.

We should support scan_all only with the new-style options syntax for
VACUUM; that is, vacuum (scan_all) rename.  That doesn't require
making scan_all a keyword, which is good: this is a minor feature, and
we don't want to bloat the parsing tables for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

17 May 2016, 22:16:27

On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, May 10, 2016 at 10:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Or second way I came up with is having tool to remove particular _vm
>> file safely, which is executed via SQL or client tool like
>> pg_resetxlog.
>>
>> Attached updated VACUUM SCAN_ALL patch.
>> Please find it.
>
> We should support scan_all only with the new-style options syntax for
> VACUUM; that is, vacuum (scan_all) rename.  That doesn't require
> making scan_all a keyword, which is good: this is a minor feature, and
> we don't want to bloat the parsing tables for it.
>

I agree with having new-style options syntax.
Isn't it better to have SCAN_ALL option without parentheses?

Syntaxes are;
VACUUM SCAN_ALL table_name;
VACUUM SCAN_ALL;  -- for all tables on database

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

17 May 2016, 22:32:24

Masahiko Sawada wrote:
> On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> > We should support scan_all only with the new-style options syntax for
> > VACUUM; that is, vacuum (scan_all) rename.  That doesn't require
> > making scan_all a keyword, which is good: this is a minor feature, and
> > we don't want to bloat the parsing tables for it.
> 
> I agree with having new-style options syntax.
> Isn't it better to have SCAN_ALL option without parentheses?
> 
> Syntaxes are;
> VACUUM SCAN_ALL table_name;
> VACUUM SCAN_ALL;  -- for all tables on database

No, I agree with Robert that we shouldn't add any more such options to
avoid keyword proliferation.
Syntaxes are;VACUUM (SCAN_ALL) table_name;VACUUM (SCAN_ALL);  -- for all tables on database

Is SCAN_ALL really the best we can do here?  The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like).  How about COMPLETE, TOTAL, or WHOLE?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

17 May 2016, 23:34:02

On Tue, May 17, 2016 at 3:32 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Masahiko Sawada wrote:
>> On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> > We should support scan_all only with the new-style options syntax for
>> > VACUUM; that is, vacuum (scan_all) rename.  That doesn't require
>> > making scan_all a keyword, which is good: this is a minor feature, and
>> > we don't want to bloat the parsing tables for it.
>>
>> I agree with having new-style options syntax.
>> Isn't it better to have SCAN_ALL option without parentheses?
>>
>> Syntaxes are;
>> VACUUM SCAN_ALL table_name;
>> VACUUM SCAN_ALL;  -- for all tables on database
>
> No, I agree with Robert that we shouldn't add any more such options to
> avoid keyword proliferation.
>
>  Syntaxes are;
>  VACUUM (SCAN_ALL) table_name;
>  VACUUM (SCAN_ALL);  -- for all tables on database

Okay, I agree with this.

> Is SCAN_ALL really the best we can do here?  The business of having an
> underscore in an option name has no precedent (other than
> CURRENT_DATABASE and the like).

Another way is having tool or function that removes _vm file safely for example.

> How about COMPLETE, TOTAL, or WHOLE?

IMHO, I don't have strong opinion about SCAN_ALL as long as we have
document about that option and option name doesn't confuse users.
But ISTM that COMPLETE, TOTAL might make users mislead normal vacuum
as it doesn't do that completely.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

17 May 2016, 23:35:02

On 05/17/2016 12:32 PM, Alvaro Herrera wrote:

>   Syntaxes are;
>   VACUUM (SCAN_ALL) table_name;
>   VACUUM (SCAN_ALL);  -- for all tables on database
>
> Is SCAN_ALL really the best we can do here?  The business of having an
> underscore in an option name has no precedent (other than
> CURRENT_DATABASE and the like).  How about COMPLETE, TOTAL, or WHOLE?
>

VACUUM (ANALYZE, VERBOSE, WHOLE)
....

That seems reasonable? I agree that SCAN_ALL doesn't fit. I am not 
trying to pull a left turn but is there a technical reason we don't just 
make FULL do this?

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

18 May 2016, 00:33:22

On Tue, May 17, 2016 at 4:34 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On 05/17/2016 12:32 PM, Alvaro Herrera wrote:
>
>>   Syntaxes are;
>>   VACUUM (SCAN_ALL) table_name;
>>   VACUUM (SCAN_ALL);  -- for all tables on database
>>
>> Is SCAN_ALL really the best we can do here?  The business of having an
>> underscore in an option name has no precedent (other than
>> CURRENT_DATABASE and the like).  How about COMPLETE, TOTAL, or WHOLE?
>>
>
> VACUUM (ANALYZE, VERBOSE, WHOLE)
> ....
>
> That seems reasonable? I agree that SCAN_ALL doesn't fit. I am not trying to
> pull a left turn but is there a technical reason we don't just make FULL do
> this?
>

FULL option requires AccessExclusiveLock, which could be a problem.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Vik Fearing

Date:

18 May 2016, 00:34:47

On 17/05/16 21:32, Alvaro Herrera wrote:
> Is SCAN_ALL really the best we can do here?  The business of having an
> underscore in an option name has no precedent (other than
> CURRENT_DATABASE and the like).

ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and
IS_TEMPLATE.

> How about COMPLETE, TOTAL, or WHOLE?

Sure, I'll play this game.  I like EXHAUSTIVE.
-- 
Vik Fearing                                          +33 6 46 75 15 36
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Reviewing freeze map code

From

Gavin Flower

Date:

18 May 2016, 00:48:25

On 18/05/16 09:34, Vik Fearing wrote:
> On 17/05/16 21:32, Alvaro Herrera wrote:
>> Is SCAN_ALL really the best we can do here?  The business of having an
>> underscore in an option name has no precedent (other than
>> CURRENT_DATABASE and the like).
> ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and
> IS_TEMPLATE.
>
>> How about COMPLETE, TOTAL, or WHOLE?
> Sure, I'll play this game.  I like EXHAUSTIVE.

I prefer 'WHOLE', as it seems more obvious (and not because of the pun 
relating to 'wholesomeness'!!!)

Re: Reviewing freeze map code

From

Robert Haas

Date:

18 May 2016, 13:37:37

On Tue, May 17, 2016 at 5:47 PM, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:
> On 18/05/16 09:34, Vik Fearing wrote:
>> On 17/05/16 21:32, Alvaro Herrera wrote:
>>>
>>> Is SCAN_ALL really the best we can do here?  The business of having an
>>> underscore in an option name has no precedent (other than
>>> CURRENT_DATABASE and the like).
>>
>> ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and
>> IS_TEMPLATE.
>>
>>> How about COMPLETE, TOTAL, or WHOLE?
>>
>> Sure, I'll play this game.  I like EXHAUSTIVE.
>
> I prefer 'WHOLE', as it seems more obvious (and not because of the pun
> relating to 'wholesomeness'!!!)

I think that users might believe that they need VACUUM (WHOLE) a lot
more often than they will actually need this option.  "Of course I
want to vacuum my whole table!"

I think we should give this a name that hints more strongly at this
being an exceptional thing, like vacuum (even_frozen_pages).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

David Steele

Date:

18 May 2016, 15:41:35

On 5/18/16 6:37 AM, Robert Haas wrote:
> On Tue, May 17, 2016 at 5:47 PM, Gavin Flower
> <GavinFlower@archidevsys.co.nz> wrote:
>> On 18/05/16 09:34, Vik Fearing wrote:
>>> On 17/05/16 21:32, Alvaro Herrera wrote:
>>>>
>>>> Is SCAN_ALL really the best we can do here?  The business of having an
>>>> underscore in an option name has no precedent (other than
>>>> CURRENT_DATABASE and the like).
>>>
>>> ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and
>>> IS_TEMPLATE.
>>>
>>>> How about COMPLETE, TOTAL, or WHOLE?
>>>
>>> Sure, I'll play this game.  I like EXHAUSTIVE.
>>
>> I prefer 'WHOLE', as it seems more obvious (and not because of the pun
>> relating to 'wholesomeness'!!!)
>
> I think that users might believe that they need VACUUM (WHOLE) a lot
> more often than they will actually need this option.  "Of course I
> want to vacuum my whole table!"
>
> I think we should give this a name that hints more strongly at this
> being an exceptional thing, like vacuum (even_frozen_pages).

How about just FROZEN?  Perhaps it's too confusing to have that and 
FREEZE, but I thought I would throw it out there.

-- 
-David
david@pgmasters.net

Re: Reviewing freeze map code

From

Robert Haas

Date:

18 May 2016, 15:51:41

On Wed, May 18, 2016 at 8:41 AM, David Steele <david@pgmasters.net> wrote:
>> I think we should give this a name that hints more strongly at this
>> being an exceptional thing, like vacuum (even_frozen_pages).
>
> How about just FROZEN?  Perhaps it's too confusing to have that and FREEZE,
> but I thought I would throw it out there.

It's not a bad thought, but I do think it might be a bit confusing.
My main priority for this new option is that people aren't tempted to
use it very often, and I think a name like "even_frozen_pages" is more
likely to accomplish that than just "frozen".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

"Joshua D. Drake"

Date:

18 May 2016, 16:42:48

On 05/18/2016 05:51 AM, Robert Haas wrote:
> On Wed, May 18, 2016 at 8:41 AM, David Steele <david@pgmasters.net> wrote:
>>> I think we should give this a name that hints more strongly at this
>>> being an exceptional thing, like vacuum (even_frozen_pages).
>>
>> How about just FROZEN?  Perhaps it's too confusing to have that and FREEZE,
>> but I thought I would throw it out there.
>
> It's not a bad thought, but I do think it might be a bit confusing.
> My main priority for this new option is that people aren't tempted to
> use it very often, and I think a name like "even_frozen_pages" is more
> likely to accomplish that than just "frozen".
>

freeze_all_pages?

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: Reviewing freeze map code

From

Robert Haas

Date:

18 May 2016, 16:45:08

On Wed, May 18, 2016 at 9:42 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> It's not a bad thought, but I do think it might be a bit confusing.
>> My main priority for this new option is that people aren't tempted to
>> use it very often, and I think a name like "even_frozen_pages" is more
>> likely to accomplish that than just "frozen".
>
> freeze_all_pages?

No, that's what the existing FREEZE option does.  This new option is
about unnecessarily vacuuming pages that don't need it.  The
expectation is that vacuuming all-frozen pages will be a no-op.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Victor Yegorov

Date:

18 May 2016, 16:56:00

2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com>:

No, that's what the existing FREEZE option does. This new option is
about unnecessarily vacuuming pages that don't need it. The
expectation is that vacuuming all-frozen pages will be a no-op.

VACUUM (INCLUDING ALL) ?

Victor Y. Yegorov

Re: Reviewing freeze map code

From

Joe Conway

Date:

18 May 2016, 17:10:01

On 05/18/2016 09:55 AM, Victor Yegorov wrote:
> 2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com
> <mailto:robertmhaas@gmail.com>>:
>
>     No, that's what the existing FREEZE option does.  This new option is
>     about unnecessarily vacuuming pages that don't need it.  The
>     expectation is that vacuuming all-frozen pages will be a no-op.
>
>
> VACUUM (INCLUDING ALL) ?

VACUUM (FORCE ALL) ?

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Reviewing freeze map code

From

Jeff Janes

Date:

18 May 2016, 18:52:41

On Wed, May 18, 2016 at 7:09 AM, Joe Conway <mail@joeconway.com> wrote:
> On 05/18/2016 09:55 AM, Victor Yegorov wrote:
>> 2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com
>> <mailto:robertmhaas@gmail.com>>:
>>
>>     No, that's what the existing FREEZE option does.  This new option is
>>     about unnecessarily vacuuming pages that don't need it.  The
>>     expectation is that vacuuming all-frozen pages will be a no-op.
>>
>>
>> VACUUM (INCLUDING ALL) ?
>
> VACUUM (FORCE ALL) ?


How about going with something that says more about why we are doing
it, rather than trying to describe in one or two words what it is
doing?

VACUUM (FORENSIC)

VACUUM (DEBUG)

VACUUM (LINT)

Cheers,

Jeff

Re: Reviewing freeze map code

From

Peter Geoghegan

Date:

18 May 2016, 22:51:12

On Wed, May 18, 2016 at 8:52 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> How about going with something that says more about why we are doing
> it, rather than trying to describe in one or two words what it is
> doing?
>
> VACUUM (FORENSIC)
>
> VACUUM (DEBUG)
>
> VACUUM (LINT)

+1


-- 
Peter Geoghegan

Re: Reviewing freeze map code

From

Josh berkus

Date:

18 May 2016, 23:08:08

On 05/18/2016 03:51 PM, Peter Geoghegan wrote:
> On Wed, May 18, 2016 at 8:52 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> How about going with something that says more about why we are doing
>> it, rather than trying to describe in one or two words what it is
>> doing?
>>
>> VACUUM (FORENSIC)
>>
>> VACUUM (DEBUG)
>>
>> VACUUM (LINT)
> 
> +1

Maybe this is the wrong perspective.  I mean, is there a reason we even
need this option, other than a lack of any other way to do a full table
scan to check for corruption, etc.?  If we're only doing this for
integrity checking, then maybe it's better if it becomes a function,
which could be later extended with additional forensic features?

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

Re: Reviewing freeze map code

From

Tom Lane

Date:

19 May 2016, 01:25:48

Josh berkus <josh@agliodbs.com> writes:
> Maybe this is the wrong perspective.  I mean, is there a reason we even
> need this option, other than a lack of any other way to do a full table
> scan to check for corruption, etc.?  If we're only doing this for
> integrity checking, then maybe it's better if it becomes a function,
> which could be later extended with additional forensic features?

Yes, I've been wondering that too.  VACUUM is not meant as a corruption
checker, and should not be made into one, so what is the point of this
flag exactly?

(AFAIK, "select count(*) from table" would offer a similar amount of
sanity checking as a full-table VACUUM scan does, so it's not like
we've removed functionality with no near-term replacement.)
        regards, tom lane

Re: Reviewing freeze map code

From

Andres Freund

Date:

19 May 2016, 01:34:44

On 2016-05-18 18:25:39 -0400, Tom Lane wrote:
> Josh berkus <josh@agliodbs.com> writes:
> > Maybe this is the wrong perspective.  I mean, is there a reason we even
> > need this option, other than a lack of any other way to do a full table
> > scan to check for corruption, etc.?  If we're only doing this for
> > integrity checking, then maybe it's better if it becomes a function,
> > which could be later extended with additional forensic features?
> 
> Yes, I've been wondering that too.  VACUUM is not meant as a corruption
> checker, and should not be made into one, so what is the point of this
> flag exactly?

Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age =
0) verified the correctness of the visibility map; and that found a
number of bugs.  Now visibilitymap grew additional responsibilities,
with a noticeable risk of data eating bugs, and there's no way to verify
whether visibilitymap's frozen bits are set correctly.

> (AFAIK, "select count(*) from table" would offer a similar amount of
> sanity checking as a full-table VACUUM scan does, so it's not like
> we've removed functionality with no near-term replacement.)

I don't think that'd do anything comparable to    /*     * As of PostgreSQL 9.2, the visibility map bit should never be
setif     * the page-level bit is clear.  However, it's possible that the bit     * got cleared after we checked it and
beforewe took the buffer     * content lock, so we must recheck before jumping to the conclusion     * that something
badhas happened.     */    else if (all_visible_according_to_vm && !PageIsAllVisible(page)             &&
VM_ALL_VISIBLE(onerel,blkno, &vmbuffer))    {        elog(WARNING, "page is not marked all-visible but visibility map
bitis set in relation \"%s\" page %u",             relname, blkno);        visibilitymap_clear(onerel, blkno,
vmbuffer);   }

If we had a checking module for all this it'd possibly be sufficient,
but we don't.

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Tom Lane

Date:

19 May 2016, 01:42:29

Andres Freund <andres@anarazel.de> writes:
> On 2016-05-18 18:25:39 -0400, Tom Lane wrote:
>> Yes, I've been wondering that too.  VACUUM is not meant as a corruption
>> checker, and should not be made into one, so what is the point of this
>> flag exactly?

> Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age =
> 0) verified the correctness of the visibility map; and that found a
> number of bugs.  Now visibilitymap grew additional responsibilities,
> with a noticeable risk of data eating bugs, and there's no way to verify
> whether visibilitymap's frozen bits are set correctly.

Meh.  I'm not sure we should grow a rather half-baked feature we'll never
be able to remove as a substitute for a separate sanity checker.  The
latter is really the right place for this kind of thing.
        regards, tom lane

Re: Reviewing freeze map code

From

Andres Freund

Date:

19 May 2016, 01:43:55

On 2016-05-18 18:42:16 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2016-05-18 18:25:39 -0400, Tom Lane wrote:
> >> Yes, I've been wondering that too.  VACUUM is not meant as a corruption
> >> checker, and should not be made into one, so what is the point of this
> >> flag exactly?
> 
> > Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age =
> > 0) verified the correctness of the visibility map; and that found a
> > number of bugs.  Now visibilitymap grew additional responsibilities,
> > with a noticeable risk of data eating bugs, and there's no way to verify
> > whether visibilitymap's frozen bits are set correctly.
> 
> Meh.  I'm not sure we should grow a rather half-baked feature we'll never
> be able to remove as a substitute for a separate sanity checker.  The
> latter is really the right place for this kind of thing.

It's not a new feature, it's a feature we removed as a side effect. And
one that allows us to evaluate whether the new feature actually works.

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

19 May 2016, 01:57:57

Andres Freund wrote:

> > (AFAIK, "select count(*) from table" would offer a similar amount of
> > sanity checking as a full-table VACUUM scan does, so it's not like
> > we've removed functionality with no near-term replacement.)
> 
> I don't think that'd do anything comparable to
>         /*
>          * As of PostgreSQL 9.2, the visibility map bit should never be set if
>          * the page-level bit is clear.  However, it's possible that the bit
>          * got cleared after we checked it and before we took the buffer
>          * content lock, so we must recheck before jumping to the conclusion
>          * that something bad has happened.
>          */
>         else if (all_visible_according_to_vm && !PageIsAllVisible(page)
>                  && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
>         {
>             elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
>                  relname, blkno);
>             visibilitymap_clear(onerel, blkno, vmbuffer);
>         }
> 
> If we had a checking module for all this it'd possibly be sufficient,
> but we don't.

Here's an idea.  We need core-blessed extensions (src/extensions/, you
know I've proposed this before), so why not take this opportunity to
create our first such and make it carry a function to scan a table
completely to do this task.

Since we were considering a new VACUUM option, surely this is serious
enough to warrant more than just contrib.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

Peter Geoghegan

Date:

19 May 2016, 02:32:40

On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Since we were considering a new VACUUM option, surely this is serious
> enough to warrant more than just contrib.

I would like to see us consider the long-term best place for amcheck's
functionality at the same time. Ideally, verification would be a
somewhat generic operation, with AM-specific code invoked as
appropriate.

-- 
Peter Geoghegan

Re: Reviewing freeze map code

From

Noah Misch

Date:

29 May 2016, 08:44:33

On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote:
> On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
> > +                       char            new_vmbuf[BLCKSZ];
> > +                       char       *new_cur = new_vmbuf;
> > +                       bool            empty = true;
> > +                       bool            old_lastpart;
> > +
> > +                       /* Copy page header in advance */
> > +                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
> >
> > Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
> > with old_lastpart && !empty, right?
> 
> Oh, dear.  That seems like a possible data corruption bug.  Maybe we'd
> better fix that right away (although I don't actually have time before
> the wrap).

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

30 May 2016, 05:42:15

On Sun, May 29, 2016 at 2:44 PM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote:
>> On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
>> > +                       char            new_vmbuf[BLCKSZ];
>> > +                       char       *new_cur = new_vmbuf;
>> > +                       bool            empty = true;
>> > +                       bool            old_lastpart;
>> > +
>> > +                       /* Copy page header in advance */
>> > +                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
>> >
>> > Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
>> > with old_lastpart && !empty, right?
>>
>> Oh, dear.  That seems like a possible data corruption bug.  Maybe we'd
>> better fix that right away (although I don't actually have time before
>> the wrap).
>
> [This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> 9.6 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within 72 hours of this
> message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> efforts toward speedy resolution.  Thanks.
>
> [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com

Thank you for notification.

Regarding check tool for visibility map is still under the discussion.
I'm going to address other review comments, and send the patch ASAP.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Jeff Janes

Date:

30 May 2016, 22:40:43

On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Andres Freund wrote:
>
>>
>> If we had a checking module for all this it'd possibly be sufficient,
>> but we don't.
>
> Here's an idea.  We need core-blessed extensions (src/extensions/, you
> know I've proposed this before), so why not take this opportunity to
> create our first such and make it carry a function to scan a table
> completely to do this task.
>
> Since we were considering a new VACUUM option, surely this is serious
> enough to warrant more than just contrib.

What does "core-blessed" mean?  The commit rights for contrib/ are the
same as they are for src/

Cheers,

Jeff

Re: Reviewing freeze map code

From

Michael Paquier

Date:

31 May 2016, 03:39:12

On Tue, May 31, 2016 at 4:40 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Andres Freund wrote:
>>
>>>
>>> If we had a checking module for all this it'd possibly be sufficient,
>>> but we don't.
>>
>> Here's an idea.  We need core-blessed extensions (src/extensions/, you
>> know I've proposed this before), so why not take this opportunity to
>> create our first such and make it carry a function to scan a table
>> completely to do this task.
>>
>> Since we were considering a new VACUUM option, surely this is serious
>> enough to warrant more than just contrib.
>
> What does "core-blessed" mean?  The commit rights for contrib/ are the
> same as they are for src/

Personally I understand contrib/ modules as third-part plugins that
are considered as not enough mature to be part of src/backend or
src/bin, but one day they could become so. See pg_upgrade's recent
move for example. src/extensions/ includes third-part plugins that are
thought as useful, are part of the main server package, but are not
something that we want to enable by default.
-- 
Michael

Re: Reviewing freeze map code

From

Robert Haas

Date:

01 June 2016, 05:22:39

On Sun, May 29, 2016 at 1:44 AM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote:
>> On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
>> > +                       char            new_vmbuf[BLCKSZ];
>> > +                       char       *new_cur = new_vmbuf;
>> > +                       bool            empty = true;
>> > +                       bool            old_lastpart;
>> > +
>> > +                       /* Copy page header in advance */
>> > +                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
>> >
>> > Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
>> > with old_lastpart && !empty, right?
>>
>> Oh, dear.  That seems like a possible data corruption bug.  Maybe we'd
>> better fix that right away (although I don't actually have time before
>> the wrap).
>
> [This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> 9.6 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within 72 hours of this
> message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> efforts toward speedy resolution.  Thanks.

I am going to try to find time to look at this later this week, but
realistically it's going to be a little bit difficult to find that
time.  I was away over Memorial Day weekend and was in meetings most
of today.  I have a huge pile of email to catch up on.  I will send
another status update no later than Friday.  If Andres or anyone else
wants to jump in and fix this up meanwhile, that would be great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

03 June 2016, 17:03:02

On Wed, Jun 1, 2016 at 3:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached patch fixes only above comments, other are being addressed now.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

03 June 2016, 17:03:44

On Thu, Jun 2, 2016 at 11:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached patch optimises skipping pages logic so that blkno can jump to
> next_unskippable_block directly while counting the number of all_visible
> and all_frozen pages. So we can avoid double checking visibility map.

I think this is 9.7 material.  This patch has already won the
"scariest patch" tournament.  Changing the logic more than necessary
at this late date seems like it just increases the scariness.  I think
this is an opportunity for further optimization, not a defect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

03 June 2016, 17:49:48

On Fri, Jun 3, 2016 at 11:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 2, 2016 at 11:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached patch optimises skipping pages logic so that blkno can jump to
>> next_unskippable_block directly while counting the number of all_visible
>> and all_frozen pages. So we can avoid double checking visibility map.
>
> I think this is 9.7 material.  This patch has already won the
> "scariest patch" tournament.  Changing the logic more than necessary
> at this late date seems like it just increases the scariness.  I think
> this is an opportunity for further optimization, not a defect.
>

I agree with you.
I'll submit this as a improvement for 9.7.
That patch also incorporates the following review comment.
We can push at least this fix.
>>         /*
>>          * Compute whether we actually scanned the whole relation. If we did, we
>>          * can adjust relfrozenxid and relminmxid.
>>          *
>>          * NB: We need to check this before truncating the relation, because that
>>          * will change ->rel_pages.
>>          */
>>
>> Comment is out-of-date now.

I'm address the review comment of 7087166 commit, and will post the patch.
And testing feature for freeze map is under the discussion.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Robert Haas

Date:

03 June 2016, 18:08:31

On Fri, Jun 3, 2016 at 10:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> That patch also incorporates the following review comment.
> We can push at least this fix.

Can you submit that part as a separate patch?

> I'm address the review comment of 7087166 commit, and will post the patch.

When?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

03 June 2016, 19:00:04

On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Can you submit that part as a separate patch?
>
> Attached.

Thanks, committed.

>>> I'm address the review comment of 7087166 commit, and will post the patch.
>>
>> When?
>
> On Saturday.

Great.  Will that address everything for this open item, then?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

04 June 2016, 06:41:36

On Fri, Jun 3, 2016 at 10:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> +                       char            new_vmbuf[BLCKSZ];
>>> +                       char       *new_cur = new_vmbuf;
>>> +                       bool            empty = true;
>>> +                       bool            old_lastpart;
>>> +
>>> +                       /* Copy page header in advance */
>>> +                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
>>>
>>> Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
>>> with old_lastpart && !empty, right?
>>
>> Oh, dear.  That seems like a possible data corruption bug.  Maybe we'd
>> better fix that right away (although I don't actually have time before
>> the wrap).

Actually, on second thought, I'm not seeing the bug here.  It seems to
me that the loop commented this way:
           /* Process old page bytes one by one, and turn it into new page. */

...should always write to every byte in new_vmbuf, because we process
exactly half the bytes in the old block at a time, and so that's going
to generate exactly one full page of new bytes.  Am I missing
something?

> Since the force is always set true, I removed the force from argument
> of copyFile() and rewriteVisibilityMap().
> And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags .

I'm not happy with this.  I think we should always open with O_EXCL,
because the new file is not expected to exist and if it does,
something's probably broken.  I think we should default to the safe
behavior (which is failing) rather than the unsafe behavior (which is
clobbering data).

(Status update for Noah: I expect Masahiko Sawada will respond
quickly, but if not I'll give some kind of update by Monday COB
anyhow.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

04 June 2016, 07:47:27

On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Can you submit that part as a separate patch?
>>
>> Attached.
>
> Thanks, committed.
>
>>>> I'm address the review comment of 7087166 commit, and will post the patch.
>>>
>>> When?
>>
>> On Saturday.
>
> Great.  Will that address everything for this open item, then?
>

Attached patch for commit 7087166 on another mail.
I think that only the test tool for visibility map is remaining and
under the discussion.
Even if we have verification tool or function for visibility map, we
cannot repair the contents of visibility map if we turned out that
contents of visibility map is something wrong.
So I think we should have the way that re-generates the visibility map.
For this purpose, doing vacuum while ignoring visibility map by a new
option or new function is one idea.
But IMHO, it's not good idea to allow a function to do vacuum, and
expanding the VACUUM syntax might be somewhat overkill.

So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
If this parameter is set true (false by default), we do vacuum whole
table forcibly and re-generate visibility map.
The advantage of this idea is that we don't necessary to expand VACUUM
syntax and relatively easily can remove this parameter if it's not
necessary anymore.

Thought?

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Michael Paquier

Date:

06 June 2016, 12:11:24

On Mon, Jun 6, 2016 at 5:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> Can you submit that part as a separate patch?
>>>>
>>>> Attached.
>>>
>>> Thanks, committed.
>>>
>>>>>> I'm address the review comment of 7087166 commit, and will post the patch.
>>>>>
>>>>> When?
>>>>
>>>> On Saturday.
>>>
>>> Great.  Will that address everything for this open item, then?
>>>
>>
>> Attached patch for commit 7087166 on another mail.
>> I think that only the test tool for visibility map is remaining and
>> under the discussion.
>> Even if we have verification tool or function for visibility map, we
>> cannot repair the contents of visibility map if we turned out that
>> contents of visibility map is something wrong.
>> So I think we should have the way that re-generates the visibility map.
>> For this purpose, doing vacuum while ignoring visibility map by a new
>> option or new function is one idea.
>> But IMHO, it's not good idea to allow a function to do vacuum, and
>> expanding the VACUUM syntax might be somewhat overkill.
>>
>> So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
>> If this parameter is set true (false by default), we do vacuum whole
>> table forcibly and re-generate visibility map.
>> The advantage of this idea is that we don't necessary to expand VACUUM
>> syntax and relatively easily can remove this parameter if it's not
>> necessary anymore.
>>
>
> Attached is a sample patch that controls full page vacuum by new GUC parameter.

Don't we want a reloption for that? Just wondering...
-- 
Michael

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 12:34:39

On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> Attached is a sample patch that controls full page vacuum by new GUC parameter.
>
> Don't we want a reloption for that? Just wondering...

Why?  Just for consistency?  I think the bigger question here is
whether we need to do anything at all.  It's true that, without some
new option, we'll lose the ability to forcibly vacuum every page in
the relation, even if all-frozen.  But there's not much use case for
that in the first place.  It will be potentially helpful if it turns
out that we have a bug that sets the all-frozen bit on pages that are
not, in fact, all-frozen.  Otherwise, what's the use?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

06 June 2016, 13:23:23

On Mon, Jun 6, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>> Attached is a sample patch that controls full page vacuum by new GUC parameter.
>>
>> Don't we want a reloption for that? Just wondering...
>
> Why?  Just for consistency?  I think the bigger question here is
> whether we need to do anything at all.  It's true that, without some
> new option, we'll lose the ability to forcibly vacuum every page in
> the relation, even if all-frozen.  But there's not much use case for
> that in the first place.  It will be potentially helpful if it turns
> out that we have a bug that sets the all-frozen bit on pages that are
> not, in fact, all-frozen.  Otherwise, what's the use?
>

I cannot agree with using this parameter as a reloption.
We set it true only when the serious bug is discovered and we want to
re-generate the visibility maps of specific tables.
I thought that control by GUC parameter would be convenient rather
than adding the new option.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 14:46:29

On Sat, Jun 4, 2016 at 12:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached updated patch.

The error-checking enhancements here look good to me, except that you
forgot to initialize totalBytesRead.  I've committed those changes
with a fix for that problem and will look at the rest of this
separately.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Tom Lane

Date:

06 June 2016, 16:53:17

Masahiko Sawada <sawada.mshk@gmail.com> writes:
> On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
>> If this parameter is set true (false by default), we do vacuum whole
>> table forcibly and re-generate visibility map.
>> The advantage of this idea is that we don't necessary to expand VACUUM
>> syntax and relatively easily can remove this parameter if it's not
>> necessary anymore.

> Attached is a sample patch that controls full page vacuum by new GUC parameter.

I find this approach fairly ugly ... it's randomly inconsistent with other
VACUUM parameters for no very defensible reason.  Taking out GUCs is not
easier than taking out statement parameters; you risk breaking
applications either way.
        regards, tom lane

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 17:04:03

On Mon, Jun 6, 2016 at 7:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jun 4, 2016 at 12:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached updated patch.
>
> The error-checking enhancements here look good to me, except that you
> forgot to initialize totalBytesRead.  I've committed those changes
> with a fix for that problem and will look at the rest of this
> separately.

Committed that now, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 17:06:15

On Mon, Jun 6, 2016 at 9:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Masahiko Sawada <sawada.mshk@gmail.com> writes:
>> On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
>>> If this parameter is set true (false by default), we do vacuum whole
>>> table forcibly and re-generate visibility map.
>>> The advantage of this idea is that we don't necessary to expand VACUUM
>>> syntax and relatively easily can remove this parameter if it's not
>>> necessary anymore.
>
>> Attached is a sample patch that controls full page vacuum by new GUC parameter.
>
> I find this approach fairly ugly ... it's randomly inconsistent with other
> VACUUM parameters for no very defensible reason.

Just to be sure I understand, in what way is it inconsistent?

> Taking out GUCs is not
> easier than taking out statement parameters; you risk breaking
> applications either way.

Agreed, but that doesn't really answer the question of which one we
should have, if either.  My gut feeling on this is to either do
nothing or add a VACUUM option (not a GUC, not a reloption) called
even_frozen_pages, default false.  What is your opinion?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Tom Lane

Date:

06 June 2016, 17:15:34

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 6, 2016 at 9:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Taking out GUCs is not
>> easier than taking out statement parameters; you risk breaking
>> applications either way.

> Agreed, but that doesn't really answer the question of which one we
> should have, if either.  My gut feeling on this is to either do
> nothing or add a VACUUM option (not a GUC, not a reloption) called
> even_frozen_pages, default false.  What is your opinion?

That's about where I stand, with some preference for "do nothing".
I'm not convinced we need this.
        regards, tom lane

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 17:17:57

On Fri, Jun 3, 2016 at 11:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> (Status update for Noah: I expect Masahiko Sawada will respond
> quickly, but if not I'll give some kind of update by Monday COB
> anyhow.)

I believe this open item is now closed, unless Andres has more
comments or wishes to discuss any point further, with the exception
that we still need to decide whether to add VACUUM (even_frozen_pages)
or some variant of that.  I have added a new open item for that issue
and marked this one as resolved.

My intended strategy as the presumptive owner of the new items is to
do nothing unless more of a consensus emerges than we have presently.
We do not seem to have clear agreement on whether to add the new
option; whether to make it a GUC, a reloption, a VACUUM syntax option,
or some combination of those things; and whether it should blow up the
existing VM and rebuild it (as proposed by Sawada-san) or just force
frozen pages to be scanned in the hope that something good will happen
(as proposed by Andres).  In the absence of consensus, doing nothing
is a reasonable choice here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

06 June 2016, 17:18:23

Robert Haas wrote:

> My gut feeling on this is to either do nothing or add a VACUUM option
> (not a GUC, not a reloption) called even_frozen_pages, default false.
> What is your opinion?

+1 for that approach -- I thought that was already agreed weeks ago and
the only question was what to name that option.  even_frozen_pages
sounds better than SCANALL, SCAN_ALL, FREEZE, FORCE (the other
options I saw proposed in that subthread), so +1 for that naming
too.

I don't like doing nothing; that means that when we discover a bug we'll
have to tell users to rm a file whose name requires a complicated
catalog query to find out, so -1 for that.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 17:34:12

On Mon, Jun 6, 2016 at 10:18 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> My gut feeling on this is to either do nothing or add a VACUUM option
>> (not a GUC, not a reloption) called even_frozen_pages, default false.
>> What is your opinion?
>
> +1 for that approach -- I thought that was already agreed weeks ago and
> the only question was what to name that option.  even_frozen_pages
> sounds better than SCANALL, SCAN_ALL, FREEZE, FORCE (the other
> options I saw proposed in that subthread), so +1 for that naming
> too.
>
> I don't like doing nothing; that means that when we discover a bug we'll
> have to tell users to rm a file whose name requires a complicated
> catalog query to find out, so -1 for that.

So... I agree that it is definitely not good if we have to tell users
to rm a file, but I am not quite sure how this new option would
prevent us from having to say that?  Here are some potential kinds of
bugs we might have:

1. Sometimes, the all-frozen bit doesn't get set when it should.
2. Sometimes, the all-frozen bit gets sit when it shouldn't.
3. Some combination of (1) and (2), so that the VM fork can't be
trusted in either direction.

If (1) happens, removing the VM fork is not a good idea; what people
will want to do is re-run a VACUUM FREEZE.

If (2) or (3) happens, removing the VM fork might be a good idea, but
it's not really clear that VACUUM (even_frozen_pages) will help much.
For one thing, if there are actually unfrozen tuples on those pages
and the clog pages which they reference are already gone or recycled,
rerunning VACUUM on the table in any form might permanently lose data,
or maybe it will just fail.

If because of the nature of the bug you somehow know that case doesn't
pertain, then I suppose the bug is that the tuple-level and page-level
state is out of sync.  VACUUM (even_frozen_pages) probably won't help
with that much either, because VACUUM never clears the all-frozen bit
without also clearing the all-visible bit, and that only if the page
contains dead tuples, which in this case it probably doesn't.

I'm intuitively sympathetic to the idea that we should have an option
for this, but I can't figure out in what case we'd actually tell
anyone to use it.  It would be useful for the kinds of bugs listed
above to have VACUUM (rebuild_vm) to blow away the VM fork and rebuild
it, but that's different semantics than what we proposed for VACUUM
(even_frozen_pages).  And I'd be sort of inclined to handle that case
by providing some other way to remove VM forks (like a new function in
the pg_visibilitymap contrib module, maybe?) and then just tell people
to run regular VACUUM afterwards, rather than putting the actual VM
fork removal into VACUUM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Tom Lane

Date:

06 June 2016, 18:01:02

Robert Haas <robertmhaas@gmail.com> writes:
> I'm intuitively sympathetic to the idea that we should have an option
> for this, but I can't figure out in what case we'd actually tell
> anyone to use it.  It would be useful for the kinds of bugs listed
> above to have VACUUM (rebuild_vm) to blow away the VM fork and rebuild
> it, but that's different semantics than what we proposed for VACUUM
> (even_frozen_pages).  And I'd be sort of inclined to handle that case
> by providing some other way to remove VM forks (like a new function in
> the pg_visibilitymap contrib module, maybe?) and then just tell people
> to run regular VACUUM afterwards, rather than putting the actual VM
> fork removal into VACUUM.

There's a lot to be said for that approach.  If we do it, I'd be a bit
inclined to offer an option to blow away the FSM as well.
        regards, tom lane

Re: Reviewing freeze map code

From

Andres Freund

Date:

06 June 2016, 18:28:16

On 2016-06-06 05:34:32 -0400, Robert Haas wrote:
> On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> >> Attached is a sample patch that controls full page vacuum by new GUC parameter.
> >
> > Don't we want a reloption for that? Just wondering...
> 
> Why?  Just for consistency?  I think the bigger question here is
> whether we need to do anything at all.  It's true that, without some
> new option, we'll lose the ability to forcibly vacuum every page in
> the relation, even if all-frozen.  But there's not much use case for
> that in the first place.  It will be potentially helpful if it turns
> out that we have a bug that sets the all-frozen bit on pages that are
> not, in fact, all-frozen.  Otherwise, what's the use?

Except that we right now don't have any realistic way to figure out
whether this new feature actually does the right thing. Which makes
testing this *considerably* harder than just VACUUM (dwim). I think it's
unacceptable to release this feature without a way that'll tell that it
so far has/has not corrupted the database.  Would that, in a perfect
world, be vacuum? No, probably not. But since we're not in a perfect world...

Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 18:37:30

On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-06 05:34:32 -0400, Robert Haas wrote:
>> On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> >> Attached is a sample patch that controls full page vacuum by new GUC parameter.
>> >
>> > Don't we want a reloption for that? Just wondering...
>>
>> Why?  Just for consistency?  I think the bigger question here is
>> whether we need to do anything at all.  It's true that, without some
>> new option, we'll lose the ability to forcibly vacuum every page in
>> the relation, even if all-frozen.  But there's not much use case for
>> that in the first place.  It will be potentially helpful if it turns
>> out that we have a bug that sets the all-frozen bit on pages that are
>> not, in fact, all-frozen.  Otherwise, what's the use?
>
> Except that we right now don't have any realistic way to figure out
> whether this new feature actually does the right thing. Which makes
> testing this *considerably* harder than just VACUUM (dwim). I think it's
> unacceptable to release this feature without a way that'll tell that it
> so far has/has not corrupted the database.  Would that, in a perfect
> world, be vacuum? No, probably not. But since we're not in a perfect world...

I just don't see how running VACUUM on the all-frozen pages is going
to help.  In terms of diagnostic tools, you can get the VM bits and
page-level bits using the pg_visibility extension; I wrote it
precisely because of concerns like the ones you raise here.  If you
want to cross-check the page-level bits against the tuple-level bits,
you can do that with the pageinspect extension.  And if you do those
things, you can actually find out whether stuff is broken.  Vacuuming
the all-frozen pages won't tell you that.  It will either do nothing
(which doesn't tell you that things are OK) or it will change
something (possibly without reporting any message, and possibly making
a bad situation worse instead of better).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

06 June 2016, 18:45:03

On 2016-06-06 11:37:25 -0400, Robert Haas wrote:
> On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote:
> > Except that we right now don't have any realistic way to figure out
> > whether this new feature actually does the right thing. Which makes
> > testing this *considerably* harder than just VACUUM (dwim). I think it's
> > unacceptable to release this feature without a way that'll tell that it
> > so far has/has not corrupted the database.  Would that, in a perfect
> > world, be vacuum? No, probably not. But since we're not in a perfect world...
> 
> I just don't see how running VACUUM on the all-frozen pages is going
> to help.

Because we can tell people in the beta2 announcement or some wiki page
"please run VACUUM(scan_all)" and check whether it emits WARNINGs. And
if we suspect freeze map in bug reports, we can just ask reporters to
run a VACUUM (scan_all).


> In terms of diagnostic tools, you can get the VM bits and
> page-level bits using the pg_visibility extension; I wrote it
> precisely because of concerns like the ones you raise here.  If you
> want to cross-check the page-level bits against the tuple-level bits,
> you can do that with the pageinspect extension.  And if you do those
> things, you can actually find out whether stuff is broken.

That's WAY out ouf reach of any "normal users". Adding a vacuum option
is doable, writing complex queries is not.


> Vacuuming the all-frozen pages won't tell you that.  It will either do
> nothing (which doesn't tell you that things are OK) or it will change
> something (possibly without reporting any message, and possibly making
> a bad situation worse instead of better).

We found a number of bugs for the equivalent issues in all-visible
handling via the vacuum error reporting around those.

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Tom Lane

Date:

06 June 2016, 18:45:08

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote:
>> Except that we right now don't have any realistic way to figure out
>> whether this new feature actually does the right thing.

> I just don't see how running VACUUM on the all-frozen pages is going
> to help.

Yes.  I don't see that any of the proposed features would be very useful
for answering the question "is my VM incorrect".  Maybe they would fix
problems, and maybe not, but in any case you couldn't rely on VACUUM
to tell you about a problem.  (Even if you've got warning messages in
there, they might disappear into the postmaster log during an
auto-vacuum.  Warning messages in VACUUM are not a good debugging
technology.)
        regards, tom lane

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 21:19:15

On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote:
>> In terms of diagnostic tools, you can get the VM bits and
>> page-level bits using the pg_visibility extension; I wrote it
>> precisely because of concerns like the ones you raise here.  If you
>> want to cross-check the page-level bits against the tuple-level bits,
>> you can do that with the pageinspect extension.  And if you do those
>> things, you can actually find out whether stuff is broken.
>
> That's WAY out ouf reach of any "normal users". Adding a vacuum option
> is doable, writing complex queries is not.

Why would they have to write the complex query?  Wouldn't they just
need to run that we wrote for them?

I mean, I'm not 100% dead set against this option you want, but in all
honestly, I would never, ever tell anyone to use it.  Unleashing
VACUUM on possibly-damaged data is just asking it to decide to prune
away tuples you don't want gone.  I would try very hard to come up
with something to give that user that was only going to *read* the
possibly-damaged data with as little chance of modifying or erasing it
as possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Stephen Frost

Date:

06 June 2016, 21:25:19

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote:
> >> In terms of diagnostic tools, you can get the VM bits and
> >> page-level bits using the pg_visibility extension; I wrote it
> >> precisely because of concerns like the ones you raise here.  If you
> >> want to cross-check the page-level bits against the tuple-level bits,
> >> you can do that with the pageinspect extension.  And if you do those
> >> things, you can actually find out whether stuff is broken.
> >
> > That's WAY out ouf reach of any "normal users". Adding a vacuum option
> > is doable, writing complex queries is not.
>
> Why would they have to write the complex query?  Wouldn't they just
> need to run that we wrote for them?
>
> I mean, I'm not 100% dead set against this option you want, but in all
> honestly, I would never, ever tell anyone to use it.  Unleashing
> VACUUM on possibly-damaged data is just asking it to decide to prune
> away tuples you don't want gone.  I would try very hard to come up
> with something to give that user that was only going to *read* the
> possibly-damaged data with as little chance of modifying or erasing it
> as possible.

I certainly agree with this.

We need a read-only utility which checks that the system is in a correct
and valid state.  There are a few of those which have been built for
different pieces, I believe, and we really should have one for the
visibility map, but I don't think it makes sense to imply in any way
that VACUUM can or should be used for that.

Thanks!

Stephen

Re: Reviewing freeze map code

From

Andres Freund

Date:

06 June 2016, 21:36:00

On 2016-06-06 14:24:14 -0400, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote:
> > >> In terms of diagnostic tools, you can get the VM bits and
> > >> page-level bits using the pg_visibility extension; I wrote it
> > >> precisely because of concerns like the ones you raise here.  If you
> > >> want to cross-check the page-level bits against the tuple-level bits,
> > >> you can do that with the pageinspect extension.  And if you do those
> > >> things, you can actually find out whether stuff is broken.
> > >
> > > That's WAY out ouf reach of any "normal users". Adding a vacuum option
> > > is doable, writing complex queries is not.
> > 
> > Why would they have to write the complex query?  Wouldn't they just
> > need to run that we wrote for them?

Then write that query. Verify that that query performs halfway
reasonably fast. Document that it should be run against databases after
subjecting them to tests. That'd address my concern as well.


> > I mean, I'm not 100% dead set against this option you want, but in all
> > honestly, I would never, ever tell anyone to use it.  Unleashing
> > VACUUM on possibly-damaged data is just asking it to decide to prune
> > away tuples you don't want gone.  I would try very hard to come up
> > with something to give that user that was only going to *read* the
> > possibly-damaged data with as little chance of modifying or erasing it
> > as possible.

I'm more concerned about actually being able to verify that the freeze
logic does actually something meaningful, in situation where we'd *NOT*
expect any problems. If we're not trusting vacuum in that situation,
well ... 

> I certainly agree with this.
> 
> We need a read-only utility which checks that the system is in a correct
> and valid state.  There are a few of those which have been built for
> different pieces, I believe, and we really should have one for the
> visibility map, but I don't think it makes sense to imply in any way
> that VACUUM can or should be used for that.

Meh. This is vacuum behaviour that *has existed* up to this point. You
essentially removed it. Sure, I'm all for adding a verification
tool. But that's just pie in the skie at this point.  We have a complex,
data loss threatening feature, which just about nobody can verify at
this point. That's crazy.

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 22:16:17

On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Why would they have to write the complex query?  Wouldn't they just
>> > need to run that we wrote for them?
>
> Then write that query. Verify that that query performs halfway
> reasonably fast. Document that it should be run against databases after
> subjecting them to tests. That'd address my concern as well.

You know, I am starting to lose a teeny bit of patience here.  I do
appreciate you reviewing this code, very much, and genuinely, and it
would be great if more people wanted to review it.  But this kind of
reads like you think that I'm being a jerk, which I'm trying pretty
hard not to be, and like you have the right to tell assign me
arbitrary work, which I think you don't.  If you want to have a
reasonable conversation about what the options are for making this
better, great.  If you want to me to do some work to help improve
things on a patch I committed, that is 100% fair.  But I don't know
what I did to earn this response which, to me, reads as rather
demanding and rather exasperated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

06 June 2016, 23:06:48

Hi,

On 2016-06-06 15:16:10 -0400, Robert Haas wrote:
> On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > Why would they have to write the complex query?  Wouldn't they just
> >> > need to run that we wrote for them?
> >
> > Then write that query. Verify that that query performs halfway
> > reasonably fast. Document that it should be run against databases after
> > subjecting them to tests. That'd address my concern as well.
> 
> You know, I am starting to lose a teeny bit of patience here.

Same here.

> I do appreciate you reviewing this code, very much, and genuinely, and
> it would be great if more people wanted to review it.

> But this kind of reads like you think that I'm being a jerk, which I'm
> trying pretty hard not to be

I don't think you're a jerk. But I am loosing a good bit of my patience
here. I've posted these issues a month ago, and for a long while the
only thing that happened was bikeshedding about the name of something
that wasn't even decided to happen yet (obviously said bikeshedding
isn't your fault).

> and like you have the right to tell assign me arbitrary work, which I
> think you don't.

It's not like adding a parameter for this would be a lot of work,
there's even a patch out there.  I'm getting impatient because I feel
the issue of this critical feature not being testable is getting ignored
and/or played down.  And then sidetracked into a general "let's add a
database consistency checker" type discussion. Which we need, but won't
get in 9.6.

If you say: "I agree with the feature in principle, but I don't want to
spend time to review/commit it." - ok, that's fair enough. But at the
moment that isn't what I'm reading between the lines.

> If you want to have a
> reasonable conversation about what the options are for making this
> better, great.

Yes, I want that.

> If you want to me to do some work to help improve things on a patch I
> committed, that is 100% fair.  But I don't know what I did to earn
> this response which, to me, reads as rather demanding and rather
> exasperated.

I don't think it's absurd to make some demands on the committer of a
impact-heavy feature, about at least finding a realistic path towards
the new feature being realistically testable.  This is a scary (but
*REALLY IMPORTANT*) patch, and I don't understand why it's ok that we
can't push a it through a couple wraparounds under high concurrency, and
easily verify that the freeze map is in sync with the actual data.

And yes, I *am* exasperated, that I'm the only one that appears to be
scared by the lack of that capability.  I think the feature is in a
*lot* better shape than multixacts, but it certainly has the potential
to do even more damage in ways that'll essentially be unrecoverable.

Andres

Re: Reviewing freeze map code

From

Stephen Frost

Date:

06 June 2016, 23:19:25

Andres, all,

* Andres Freund (andres@anarazel.de) wrote:
> On 2016-06-06 15:16:10 -0400, Robert Haas wrote:
> > On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:
> > and like you have the right to tell assign me arbitrary work, which I
> > think you don't.
>
> It's not like adding a parameter for this would be a lot of work,
> there's even a patch out there.  I'm getting impatient because I feel
> the issue of this critical feature not being testable is getting ignored
> and/or played down.  And then sidetracked into a general "let's add a
> database consistency checker" type discussion. Which we need, but won't
> get in 9.6.

To be clear, I was pointing out that we've had similar types of
consistency checkers implemented for other big features (eg: Heikki's
work on checking that WAL works) and that it'd be good to have one here
also.

That could be as simple as a query with the right things installed, or
it might be an independent tool, but not having any way to check isn't
good.  That said, trying to make VACUUM do that doesn't make sense to me
either.

Perhaps that's not an option due to the lateness of the hour or the lack
of manpower behind it, but that doesn't seem to be what has been said so
far.

> > If you want to me to do some work to help improve things on a patch I
> > committed, that is 100% fair.  But I don't know what I did to earn
> > this response which, to me, reads as rather demanding and rather
> > exasperated.
>
> I don't think it's absurd to make some demands on the committer of a
> impact-heavy feature, about at least finding a realistic path towards
> the new feature being realistically testable.  This is a scary (but
> *REALLY IMPORTANT*) patch, and I don't understand why it's ok that we
> can't push a it through a couple wraparounds under high concurrency, and
> easily verify that the freeze map is in sync with the actual data.
>
> And yes, I *am* exasperated, that I'm the only one that appears to be
> scared by the lack of that capability.  I think the feature is in a
> *lot* better shape than multixacts, but it certainly has the potential
> to do even more damage in ways that'll essentially be unrecoverable.

Not having a straightforward way to ensure that it's working properly is
certainly concerning to me as well.

Thanks!

Stephen

Re: Reviewing freeze map code

From

Andres Freund

Date:

06 June 2016, 23:27:33

On 2016-06-06 16:18:19 -0400, Stephen Frost wrote:
> That could be as simple as a query with the right things installed, or
> it might be an independent tool, but not having any way to check isn't
> good.  That said, trying to make VACUUM do that doesn't make sense to me
> either.

The point is that VACUUM *has* these types of checks. And had so for
many years:    else if (all_visible_according_to_vm && !PageIsAllVisible(page)             && VM_ALL_VISIBLE(onerel,
blkno,&vmbuffer))    {        elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation
\"%s\"page %u",             relname, blkno);        visibilitymap_clear(onerel, blkno, vmbuffer);    }
...   else if (PageIsAllVisible(page) && has_dead_tuples)    {        elog(WARNING, "page containing dead tuples is
markedas all-visible in relation \"%s\" page %u",             relname, blkno);        PageClearAllVisible(page);
MarkBufferDirty(buf);       visibilitymap_clear(onerel, blkno, vmbuffer);    }

the point is that, after the introduction of the freeze bit, there's no
way to reach them anymore (and they're missing a useful extension of
these warnings, but ...); these warnings have caught bugs.  I don't
think it'd advocate for the vacuum option otherwise.

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 23:41:24

On Mon, Jun 6, 2016 at 4:06 PM, Andres Freund <andres@anarazel.de> wrote:
>> I do appreciate you reviewing this code, very much, and genuinely, and
>> it would be great if more people wanted to review it.
>
>> But this kind of reads like you think that I'm being a jerk, which I'm
>> trying pretty hard not to be
>
> I don't think you're a jerk. But I am loosing a good bit of my patience
> here. I've posted these issues a month ago, and for a long while the
> only thing that happened was bikeshedding about the name of something
> that wasn't even decided to happen yet (obviously said bikeshedding
> isn't your fault).

No, the bikeshedding is not my fault.

As for the timing, you posted your first comments exactly a week
before beta1 when I was still busy addressing issues that were
reported before you reported yours, and I did not think it was
realistic to get them addressed in the time available.  If you'd sent
them two weeks sooner, I would probably have done so.  Now, it's been
four weeks since beta1 wrapped, one of which was PGCon.  As far as I
understand at this point in time, your review identified exactly zero
potential data loss bugs.  (We thought there was one, but it looks
like there isn't.)  All of the non-critical defects you identified
have now been fixed, apart from the lack of a better testing tool.
And since there is ongoing discussion (call it bikeshedding if you
want) about what would actually help in that area, I really don't feel
like anything very awful is happening here.

I really don't understand how you can not weigh in on the original
thread leading up to my mid-March commits and say "hey, this needs a
better testing tool", and then when you finally get around to
reviewing it in May, I'm supposed to drop everything and write one
immediately.  Why do you get two months from the time of commit to
weigh in but I get no time to respond?  For my part, I thought I *had*
written a testing tool - that's what pg_visibility is and that's what
I used to test the feature before committing it.  Now, you think
that's not good enough, and I respect your opinion, but it's not as if
you said this back when this was being committed.  Or at least if you
did, I don't remember it.

>> and like you have the right to tell assign me arbitrary work, which I
>> think you don't.
>
> It's not like adding a parameter for this would be a lot of work,
> there's even a patch out there.  I'm getting impatient because I feel
> the issue of this critical feature not being testable is getting ignored
> and/or played down.  And then sidetracked into a general "let's add a
> database consistency checker" type discussion. Which we need, but won't
> get in 9.6.

I know there's a patch.  Both Tom and I are skeptical about whether it
adds value, and I really don't think you've spelled out in as much
detail why you think it will help as I have why I think it won't.
Initially, I was like "ok, sure, we should have that", but the more I
thought about it (another advantage of time passing: you can think
about things more) the less convinced I was that it did anything
useful.  I don't think that's very unreasonable.  The importance of
the feature is exactly why we *should* think carefully about what is
best here and not just do the first thing that pops into our head.

> If you say: "I agree with the feature in principle, but I don't want to
> spend time to review/commit it." - ok, that's fair enough. But at the
> moment that isn't what I'm reading between the lines.

No, what I'm saying is "I'm not confident that this feature adds
value, and I'm afraid that by adding it we are making ourselves feel
better without solving any real problem".  I'm also saying "let's try
to agree on what problems we need to solve first and then decide on
the solutions".

>> If you want to have a
>> reasonable conversation about what the options are for making this
>> better, great.
>
> Yes, I want that.

Great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

06 June 2016, 23:46:14

On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Why would they have to write the complex query?  Wouldn't they just
>> > need to run that we wrote for them?
>
> Then write that query. Verify that that query performs halfway
> reasonably fast. Document that it should be run against databases after
> subjecting them to tests. That'd address my concern as well.

Here is a first attempt at such a query.  It requires that the
pageinspect and pg_visibility extensions be installed.

SELECT c.oid, v.blkno, array_agg(hpi.lp) AS affect_lps FROM pg_class
c, LATERAL ROWS FROM (pg_visibility(c.oid)) v, LATERAL ROWS FROM
(heap_page_items(get_raw_page(c.oid::regclass::text, blkno::int4)))
hpi WHERE c.relkind IN ('r', 't', 'm') AND v.all_frozen AND
(((hpi.t_infomask & 768) != 768 AND hpi.t_xmin NOT IN (1, 2)) OR
(hpi.t_infomask & 2048) != 2048) GROUP BY 1, 2 ORDER BY 1, 2;

I am not sure this is 100% correct, especially the XMAX-checking part:
is HEAP_XMAX_INVALID guaranteed to be set on a fully-frozen tuple?  Is
the method of constructing the first argument to get_raw_page() going
to be robust in all cases?

I'm not sure what the performance will be on a large table, either.
That will have to be checked.  And I obviously have not done extensive
stress runs yet.  But maybe it's a start.  Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Peter Geoghegan

Date:

06 June 2016, 23:57:39

On Mon, Jun 6, 2016 at 11:35 AM, Andres Freund <andres@anarazel.de> wrote:
>> We need a read-only utility which checks that the system is in a correct
>> and valid state.  There are a few of those which have been built for
>> different pieces, I believe, and we really should have one for the
>> visibility map, but I don't think it makes sense to imply in any way
>> that VACUUM can or should be used for that.
>
> Meh. This is vacuum behaviour that *has existed* up to this point. You
> essentially removed it. Sure, I'm all for adding a verification
> tool. But that's just pie in the skie at this point.  We have a complex,
> data loss threatening feature, which just about nobody can verify at
> this point. That's crazy.

FWIW, I agree with the general sentiment. Building a stress-testing
suite would have been a good idea. In general, testability is a design
goal that I'd be willing to give up other things for.

-- 
Peter Geoghegan

Re: Reviewing freeze map code

From

Robert Haas

Date:

07 June 2016, 00:00:25

On Mon, Jun 6, 2016 at 4:27 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-06 16:18:19 -0400, Stephen Frost wrote:
>> That could be as simple as a query with the right things installed, or
>> it might be an independent tool, but not having any way to check isn't
>> good.  That said, trying to make VACUUM do that doesn't make sense to me
>> either.
>
> The point is that VACUUM *has* these types of checks. And had so for
> many years:
>                 else if (all_visible_according_to_vm && !PageIsAllVisible(page)
>                                  && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
>                 {
>                         elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation
\"%s\"page %u",

>                                  relname, blkno);
>                         visibilitymap_clear(onerel, blkno, vmbuffer);
>                 }
>                 ...
>                 else if (PageIsAllVisible(page) && has_dead_tuples)
>                 {
>                         elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page
%u",
>                                  relname, blkno);
>                         PageClearAllVisible(page);
>                         MarkBufferDirty(buf);
>                         visibilitymap_clear(onerel, blkno, vmbuffer);
>                 }
>
> the point is that, after the introduction of the freeze bit, there's no
> way to reach them anymore (and they're missing a useful extension of
> these warnings, but ...); these warnings have caught bugs.  I don't
> think it'd advocate for the vacuum option otherwise.

So a couple of things:

1. I think it is pretty misleading to say that those checks aren't
reachable any more.  It's not like we freeze every page when we mark
it all-visible.  In most cases, I think that what will happen is that
the page will be marked all-visible and then, because it is
all-visible, skipped by subsequent vacuums, so that it doesn't get
marked all-frozen until a few hundred million transactions later.  Of
course there will be some cases when a page gets marked all-visible
and all-frozen at the same time, but I don't see why we should expect
that to be the norm.

2. With the new pg_visibility extension, you can actually check the
same thing that first warning checks like this:

select * from pg_visibility('t1'::regclass) where all_visible and not
pd_all_visible;

IMHO, that's a substantial improvement over running VACUUM and
checking whether it spits out a WARNING.

The second one, you can't currently trigger for all-frozen pages.  The
query I just sent in my other email could perhaps be adapted to that
purpose, but maybe this is a good-enough reason to add VACUUM
(even_frozen_pages).

3. If you think there are analogous checks that I should add for the
frozen case, or that you want to add yourself, please say what they
are specifically.  I *did* think about it when I wrote that code and I
didn't see how to make it work.  If I had, I would have added them.
The whole point of review here is, hopefully, to illuminate what
should have been done differently - if I'd known how to do it better,
I would have done so.  Provide an idea, or better yet, provide a
patch.  If you see how to do it, coding it up shouldn't be the hard
part.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 June 2016, 00:07:06

On 2016-06-06 16:41:19 -0400, Robert Haas wrote:
> I really don't understand how you can not weigh in on the original
> thread leading up to my mid-March commits and say "hey, this needs a
> better testing tool", and then when you finally get around to
> reviewing it in May, I'm supposed to drop everything and write one
> immediately.

Meh. Asking you to "drop everything" and starting to push a month later
are very different things. The reason I'm pushing is because this atm
seems likely to slip enough that we'll decide "can't do this for
9.6". And I think that'd be seriously bad.

> Why do you get two months from the time of commit to weigh in but I
> get no time to respond?

Really? You've started to apply pressure to fix things days after
they've been discovered. It's been a month.

> For my part, I thought I *had*
> written a testing tool - that's what pg_visibility is and that's what
> I used to test the feature before committing it.

I think looking only at page level data, and not at row level data is is
insufficient. And I think we need to make $tool output the data in a way
that only returns data if things are wrong (that can be a pre-canned
query).

> Now, you think that's not good enough, and I respect your opinion, but
> it's not as if you said this back when this was being committed.  Or
> at least if you did, I don't remember it.

I think I mentioned testing ages ago, but not around the commit, no. I
kind of had assumed that it was there.  I don't think that's really
relevant though. Backend flushing was discussed and benchmarked over
months as well; and while I don't agree with your, conclusion it's
absolutely sane of you to push for changing the default on that; even if
you didn't immediately push back.

> I know there's a patch.  Both Tom and I are skeptical about whether it
> adds value, and I really don't think you've spelled out in as much
> detail why you think it will help as I have why I think it won't.

The primary reason I think it'll help because it allows users/testers to
run a simple one-line command (VACUUM (scan_all);)in their database, and
they'll get a clear "WARNING: XXX is bad" message if something's broken,
and nothing if things are ok.  Vacuum isn't a bad place for that,
because it'll be the place that removes dead item pointers and such if
things were wrongly labeled; and because we historically have emitted
warnings from there.   The more complex stuff we ask testers to run, the
less likely it is that they'll actually do that.

I'd also be ok with adding & documenting (beta release notes)
CREATE EXTENSION pg_visibility;
SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
or something olong those lines.

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Robert Haas

Date:

07 June 2016, 00:22:50

On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-06 16:41:19 -0400, Robert Haas wrote:
>> I really don't understand how you can not weigh in on the original
>> thread leading up to my mid-March commits and say "hey, this needs a
>> better testing tool", and then when you finally get around to
>> reviewing it in May, I'm supposed to drop everything and write one
>> immediately.
>
> Meh. Asking you to "drop everything" and starting to push a month later
> are very different things. The reason I'm pushing is because this atm
> seems likely to slip enough that we'll decide "can't do this for
> 9.6". And I think that'd be seriously bad.

To be clear, I'm not objecting to you pushing on this.  I just think
your tone sounds a bit, uh, antagonized.

>> Why do you get two months from the time of commit to weigh in but I
>> get no time to respond?
>
> Really? You've started to apply pressure to fix things days after
> they've been discovered. It's been a month.

Yes, it would have been nice if I had gotten to this one sooner.  But
it's not like you said "hey, hurry up" before I started working on it.
You waited until I did start working on it and *then* complained that
I didn't get to it sooner.  I cannot rewind time.

>> For my part, I thought I *had*
>> written a testing tool - that's what pg_visibility is and that's what
>> I used to test the feature before committing it.
>
> I think looking only at page level data, and not at row level data is is
> insufficient. And I think we need to make $tool output the data in a way
> that only returns data if things are wrong (that can be a pre-canned
> query).

OK.  I didn't think that was necessary, but it sure can't hurt.

>> I know there's a patch.  Both Tom and I are skeptical about whether it
>> adds value, and I really don't think you've spelled out in as much
>> detail why you think it will help as I have why I think it won't.
>
> The primary reason I think it'll help because it allows users/testers to
> run a simple one-line command (VACUUM (scan_all);)in their database, and
> they'll get a clear "WARNING: XXX is bad" message if something's broken,
> and nothing if things are ok.  Vacuum isn't a bad place for that,
> because it'll be the place that removes dead item pointers and such if
> things were wrongly labeled; and because we historically have emitted
> warnings from there.   The more complex stuff we ask testers to run, the
> less likely it is that they'll actually do that.

OK, now I understand.  Let's see if there is general agreement on this
and then we can decide how to proceed.  I think the main danger here
is that people will think that this option is more useful than it
really is and start using it in all kinds of cases where it isn't
really necessary in the hopes that it will fix problems it really
can't fix.  I think we need to write the documentation in such a way
as to be deeply discouraging to people who might otherwise be prone to
unwarranted optimism.  Otherwise, 5 years from now, we're going to be
fielding complaints from people who are unhappy that there's no way to
make autovacuum run with (even_frozen_pages true).

> I'd also be ok with adding & documenting (beta release notes)
> CREATE EXTENSION pg_visibility;
> SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
> or something olong those lines.

That wouldn't be too useful as-written in my book, because it gives
you no detail on what exactly the problem was.  Maybe it could be
"pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
TIDs are non-frozen TIDs on frozen pages.  Then I think something like
this would work:

SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
IN ('r', 't', 'm');

If you get any rows back, you've got trouble.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 June 2016, 00:26:20

Hi,


On 2016-06-06 17:00:19 -0400, Robert Haas wrote:
> 1. I think it is pretty misleading to say that those checks aren't
> reachable any more.  It's not like we freeze every page when we mark
> it all-visible.

True. What I mean is that you can't force the checks (and some that I
think should be added) to occur anymore. Once a page is frozen it'll be
kinda hard to predict whether vacuum touches it (due to the skip logic).


> 2. With the new pg_visibility extension, you can actually check the
> same thing that first warning checks like this:
> 
> select * from pg_visibility('t1'::regclass) where all_visible and not
> pd_all_visible;

Right, but not the second.


> IMHO, that's a substantial improvement over running VACUUM and
> checking whether it spits out a WARNING.

I think it's a mixed bag. I do think that WARNINGS are a lot easier to
understand for a casual user/tester; rather than having to write/copy
queries which return results where you don't know what the expected
result is.  I agree that it's better to have that in a non-modifying way
- although I'm afraid atm it's not really possible to do a
HeapTupleSatisfies* without modifications :(.


> 3. If you think there are analogous checks that I should add for the
> frozen case, or that you want to add yourself, please say what they
> are specifically.  I *did* think about it when I wrote that code and I
> didn't see how to make it work.  If I had, I would have added them.
> The whole point of review here is, hopefully, to illuminate what
> should have been done differently - if I'd known how to do it better,
> I would have done so.  Provide an idea, or better yet, provide a
> patch.  If you see how to do it, coding it up shouldn't be the hard
> part.

I think it's pretty important (and not hard) to add a check for
(all_frozen_according_to_vm && has_unfrozen_tuples). Checking for
VM_ALL_FROZEN && !VM_ALL_VISIBLE looks worthwhile as well, especially as
we could check that always, without a measurable overhead.  But the
former primarily makes sense if we have a way to force the check to
occur in a way that's not dependant on the state of neighbouring pages.


Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 June 2016, 00:43:44

Hi,

On 2016-06-06 17:22:38 -0400, Robert Haas wrote:
> > I'd also be ok with adding & documenting (beta release notes)
> > CREATE EXTENSION pg_visibility;
> > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
> > or something olong those lines.
> 
> That wouldn't be too useful as-written in my book, because it gives
> you no detail on what exactly the problem was.

True. I don't think that's a big issue though, because we'd likely want
a lot more detail after a report anyway; to analyze things properly.

> Maybe it could be
> "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
> TIDs are non-frozen TIDs on frozen pages.  Then I think something like
> this would work:
> 
> SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
> IN ('r', 't', 'm');
> 
> If you get any rows back, you've got trouble.

That'd work too; with the slight danger of returning way too much data.

- Andres

Re: Reviewing freeze map code

From

Amit Kapila

Date:

07 June 2016, 17:20:10

On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> > I'd also be ok with adding & documenting (beta release notes)
> > CREATE EXTENSION pg_visibility;
> > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
> > or something olong those lines.
>
> That wouldn't be too useful as-written in my book, because it gives
> you no detail on what exactly the problem was. Maybe it could be
> "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
> TIDs are non-frozen TIDs on frozen pages. Then I think something like
> this would work:
>
> SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
> IN ('r', 't', 'm');
>

I have implemented the above function in attached patch. Currently, it returns SETOF tupleids, but if we want some variant of same, that should also be possible.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

pg_check_visibility_func_v1.patch

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

07 June 2016, 19:40:55

On Tue, Jun 7, 2016 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:
>>
>>
>> > I'd also be ok with adding & documenting (beta release notes)
>> > CREATE EXTENSION pg_visibility;
>> > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT
>> > pg_check_visibility(oid);
>> > or something olong those lines.
>>
>> That wouldn't be too useful as-written in my book, because it gives
>> you no detail on what exactly the problem was.  Maybe it could be
>> "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
>> TIDs are non-frozen TIDs on frozen pages.  Then I think something like
>> this would work:
>>
>> SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
>> IN ('r', 't', 'm');
>>
>
> I have implemented the above function in attached patch.  Currently, it
> returns SETOF tupleids, but if we want some variant of same, that should
> also be possible.
>
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

Thank you for implementing the patch.

I've not test it deeply but here are some comments.
This check tool only checks if the frozen page has live-unfrozen tuple.
That is, it doesn't care in case where the all-frozen page mistakenly
has dead-frozen tuple.
I think this tool should check such case, otherwise the function name
would need to be changed.

+       /* Clean up. */
+       if (vmbuffer != InvalidBuffer)
+               ReleaseBuffer(vmbuffer);

I think that we should use BufferIsValid() here.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 June 2016, 20:31:31

On 2016-06-07 19:49:59 +0530, Amit Kapila wrote:
> On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> >
> > > I'd also be ok with adding & documenting (beta release notes)
> > > CREATE EXTENSION pg_visibility;
> > > SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT
> pg_check_visibility(oid);
> > > or something olong those lines.
> >
> > That wouldn't be too useful as-written in my book, because it gives
> > you no detail on what exactly the problem was.  Maybe it could be
> > "pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
> > TIDs are non-frozen TIDs on frozen pages.  Then I think something like
> > this would work:
> >
> > SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
> > IN ('r', 't', 'm');
> >
> 
> I have implemented the above function in attached patch.  Currently, it
> returns SETOF tupleids, but if we want some variant of same, that should
> also be possible.

Cool!

I think if we go with the pg_check_visibility approach, we should also
copy the other consistency checks from vacuumlazy.c, given they can't
easily be triggered.  Wonder how we can report both block and tuple
level issues. Kinda inclined to report everything as a block level
issue?

Regards,

Andres

Re: Reviewing freeze map code

From

Jim Nasby

Date:

08 June 2016, 04:10:28

On 6/6/16 3:57 PM, Peter Geoghegan wrote:
> On Mon, Jun 6, 2016 at 11:35 AM, Andres Freund <andres@anarazel.de> wrote:
>>> We need a read-only utility which checks that the system is in a correct
>>> and valid state.  There are a few of those which have been built for
>>> different pieces, I believe, and we really should have one for the
>>> visibility map, but I don't think it makes sense to imply in any way
>>> that VACUUM can or should be used for that.
>>
>> Meh. This is vacuum behaviour that *has existed* up to this point. You
>> essentially removed it. Sure, I'm all for adding a verification
>> tool. But that's just pie in the skie at this point.  We have a complex,
>> data loss threatening feature, which just about nobody can verify at
>> this point. That's crazy.
>
> FWIW, I agree with the general sentiment. Building a stress-testing
> suite would have been a good idea. In general, testability is a design
> goal that I'd be willing to give up other things for.

Related to that, I suspect it would be helpful if it was possible to 
test boundary cases in this kind of critical code by separating the 
logic from the underlying implementation. It becomes very hard to verify 
the system does the right thing in some of these scenarios, because it's 
so difficult to put the system into that state to begin with. Stuff that 
depends on burning through a large number of XIDs is an example of that. 
(To be clear, I'm talking about unit-test kind of stuff here, not 
validating an existing system.)
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Reviewing freeze map code

From

Robert Haas

Date:

08 June 2016, 06:07:36

On Tue, Jun 7, 2016 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have implemented the above function in attached patch.  Currently, it
> returns SETOF tupleids, but if we want some variant of same, that should
> also be possible.

I think we'd want to bump the pg_visibility version to 1.1 and do the
upgrade dance, since the existing thing was in beta1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Amit Kapila

Date:

08 June 2016, 06:19:42

On Tue, Jun 7, 2016 at 10:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Thank you for implementing the patch.
>
> I've not test it deeply but here are some comments.
> This check tool only checks if the frozen page has live-unfrozen tuple.
> That is, it doesn't care in case where the all-frozen page mistakenly
> has dead-frozen tuple.

Do you mean to say that we should have a check for ItemIdIsDead() and then if item is found to be dead, then add it to array of non_frozen items? If so, earlier I thought we might not need this check as we are already using heap_tuple_needs_eventual_freeze(), but now again looking at it, it seems wise to check for dead items separately as those won't be covered by other check.

>
> + /* Clean up. */
> + if (vmbuffer != InvalidBuffer)
> + ReleaseBuffer(vmbuffer);
>
> I think that we should use BufferIsValid() here.
>

We can use BufferIsValid() as well, but I am trying to be consistent with nearby code, refer collect_visibility_data(). We can change at all places together if people prefer that way.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Amit Kapila

Date:

08 June 2016, 06:47:54

On Wed, Jun 8, 2016 at 8:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jun 7, 2016 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have implemented the above function in attached patch. Currently, it
> > returns SETOF tupleids, but if we want some variant of same, that should
> > also be possible.
>
> I think we'd want to bump the pg_visibility version to 1.1 and do the
> upgrade dance, since the existing thing was in beta1.
>

Okay, will do it in next version of patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Amit Kapila

Date:

08 June 2016, 07:35:03

On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:>
> I think if we go with the pg_check_visibility approach, we should also
> copy the other consistency checks from vacuumlazy.c, given they can't
> easily be triggered.

Are you referring to checks that are done in lazy_scan_heap() for each block? I think the meaning full checks in this context could be (a) page is marked as visible, but corresponding vm is not marked. (b) page is marked as all visible and has dead tuples. (c) vm bit indicates frozen, but page contains non-frozen tuples.

I think right now the design of pg_visibility is such that it returns the required information at page level to user by means of various functions like pg_visibility, pg_visibility_map, etc. If we want to add page level checks in this new routine as well, then we have to think what should be the output if such checks fails, shall we issue warning, shall we return information in some other way. Also, I think there will be some duplicity with the already provided information via other functions of this module.

>
> Wonder how we can report both block and tuple
> level issues. Kinda inclined to report everything as a block level
> issue?
>

The way currently this module provides information, it seems better to have separate API's for block and tuple level inconsistency. For block level, I think most of the information can be retrieved by existing API's and for tuple level, this new API can be used.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

08 June 2016, 09:09:15

On 2016-06-08 10:04:56 +0530, Amit Kapila wrote:
> On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:>
> > I think if we go with the pg_check_visibility approach, we should also
> > copy the other consistency checks from vacuumlazy.c, given they can't
> > easily be triggered.
> 
> Are you referring to checks that are done in lazy_scan_heap() for each
> block?

Yes.


> I think the meaning full checks in this context could be (a) page
> is marked as visible, but corresponding vm is not marked. (b) page is
> marked as all visible and has dead tuples. (c) vm bit indicates frozen, but
> page contains non-frozen tuples.

Yes.


> I think right now the design of pg_visibility is such that it returns the
> required information at page level to user by means of various functions
> like pg_visibility, pg_visibility_map, etc.  If we want to add page level
> checks in this new routine as well, then we have to think what should be
> the output if such checks fails, shall we issue warning, shall we return
> information in some other way.

Right.


> Also, I think there will be some duplicity
> with the already provided information via other functions of this module.

Don't think that's a problem. One part of the functionality then is
returning the available information, the other is checking for problems
and only returning problematic blocks.


> > Wonder how we can report both block and tuple
> > level issues. Kinda inclined to report everything as a block level
> > issue?
> >
> 
> The way currently this module provides information, it seems better to have
> separate API's for block and tuple level inconsistency.  For block level, I
> think most of the information can be retrieved by existing API's and for
> tuple level, this new API can be used.

I personally think simplicity is more important than detail here; but
it's not that important.  If this reports a problem, you can look into
the nitty gritty using existing functions.

Andres

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

08 June 2016, 09:16:17

On Wed, Jun 8, 2016 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jun 7, 2016 at 10:10 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> Thank you for implementing the patch.
>>
>> I've not test it deeply but here are some comments.
>> This check tool only checks if the frozen page has live-unfrozen tuple.
>> That is, it doesn't care in case where the all-frozen page mistakenly
>> has dead-frozen tuple.
>>
>
> Do you mean to say that we should have a check for ItemIdIsDead() and then
> if item is found to be dead, then add it to array of non_frozen items?

Yes.

>  If so, earlier I thought we might not need this check as we are already using
> heap_tuple_needs_eventual_freeze(),

You're right. Sorry, I had misunderstood.

> but now again looking at it, it seems
> wise to check for dead items separately as those won't be covered by other
> check.

Sounds good.

>>
>> +       /* Clean up. */
>> +       if (vmbuffer != InvalidBuffer)
>> +               ReleaseBuffer(vmbuffer);
>>
>> I think that we should use BufferIsValid() here.
>>
>
> We can use BufferIsValid() as well, but I am trying to be consistent with
> nearby code, refer collect_visibility_data().  We can change at all places
> together if people prefer that way.
>

In vacuumlazy.c we use it like BufferisValid(vmbuffer), so I think we
can replace all these thing to be more safety if there is not specific
reason.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Amit Kapila

Date:

08 June 2016, 11:01:15

On Wed, Jun 8, 2016 at 11:39 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-08 10:04:56 +0530, Amit Kapila wrote:
> > On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:>
> > > I think if we go with the pg_check_visibility approach, we should also
> > > copy the other consistency checks from vacuumlazy.c, given they can't
> > > easily be triggered.
> >
> > Are you referring to checks that are done in lazy_scan_heap() for each
> > block?
>
> Yes.
>
>
> > I think the meaning full checks in this context could be (a) page
> > is marked as visible, but corresponding vm is not marked. (b) page is
> > marked as all visible and has dead tuples. (c) vm bit indicates frozen, but
> > page contains non-frozen tuples.
>
> Yes.
>

If we want to address both page level and tuple level inconsistencies, I could see below possibility.

1. An API that returns setof records containing a block that have inconsistent vm bit, a block where visible page contains dead tuples and a block where vm bit indicates frozen, but page contains non-frozen tuples. Three separate block numbers are required in record to distinguish the problem with block.

Signature of API will be something like:

pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint, corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS SETOF record

2. An API that provides information of non-frozen tuples on a frozen page

Signature of API:

CREATE FUNCTION pg_check_visibility_tuples(regclass, t_ctid OUT tid) RETURNS SETOF tid

This is same as what is present in current patch [1].

In this, user can use first API to find corrupt blocks if any and if further information is required, second API can be used.

Does that address your concern? If you, Robert and others are okay with above idea, then I will send an update patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1JHz%3DOB4Ya%2B_1dMRqgxrKCt4LxiSyukgm3ZzuxF2ONqGA%40mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Robert Haas

Date:

08 June 2016, 16:01:36

On Wed, Jun 8, 2016 at 4:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> If we want to address both page level and tuple level inconsistencies, I
> could see below possibility.
>
> 1. An API that returns setof records containing a block that have
> inconsistent vm bit, a block where visible page contains dead tuples and a
> block where vm bit indicates frozen, but page contains non-frozen tuples.
> Three separate block numbers are required in record to distinguish the
> problem with block.
>
> Signature of API will be something like:
> pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint,
> corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS
> SETOF record

I don't understand this, and I think we're making this too
complicated.  The function that just returned non-frozen TIDs on
supposedly-frozen pages was simple.  Now we're trying to redesign this
into a general-purpose integrity checker on the eve of beta2, and I
think that's a bad idea.  We don't have time to figure that out, get
consensus on it, and do it well, and I don't want to be stuck
supporting something half-baked from now until eternity.  Let's scale
back our goals here to something that can realistically be done well
in the time available.

Here's my proposal:

1. You already implemented a function to find non-frozen tuples on
supposedly all-frozen pages.  Great.

2. Let's implement a second function to find dead tuples on supposedly
all-visible pages.

3. And then let's call it good.

If we start getting into the game of "well, that's not enough because
you can also check for X", that's an infinite treadmill.  There will
always be more things we can check.  But that's the project of
building an integrity checker, which while worthwhile, is out of scope
for 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Amit Kapila

Date:

08 June 2016, 19:01:45

On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 8, 2016 at 4:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > If we want to address both page level and tuple level inconsistencies, I
> > could see below possibility.
> >
> > 1. An API that returns setof records containing a block that have
> > inconsistent vm bit, a block where visible page contains dead tuples and a
> > block where vm bit indicates frozen, but page contains non-frozen tuples.
> > Three separate block numbers are required in record to distinguish the
> > problem with block.
> >
> > Signature of API will be something like:
> > pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint,
> > corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS
> > SETOF record
>
> I don't understand this,

This new API was to address Andres's concern of checking block level inconsistency as we do in lazy_scan_heap. It returns set of inconsistent blocks.

>
> The function that just returned non-frozen TIDs on
> supposedly-frozen pages was simple. Now we're trying to redesign this
> into a general-purpose integrity checker on the eve of beta2, and I
> think that's a bad idea. We don't have time to figure that out, get
> consensus on it, and do it well, and I don't want to be stuck
> supporting something half-baked from now until eternity. Let's scale
> back our goals here to something that can realistically be done well
> in the time available.
>
> Here's my proposal:
>
> 1. You already implemented a function to find non-frozen tuples on
> supposedly all-frozen pages. Great.
>
> 2. Let's implement a second function to find dead tuples on supposedly
> all-visible pages.
>
> 3. And then let's call it good.
>

Your proposal sounds good, will send an updated patch, if there are no further concerns.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Amit Kapila

Date:

09 June 2016, 06:18:30

On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> Here's my proposal:
>

> 1. You already implemented a function to find non-frozen tuples on
> supposedly all-frozen pages. Great.
>
> 2. Let's implement a second function to find dead tuples on supposedly
> all-visible pages.
>

I am planning to name them as pg_check_frozen and pg_check_visible, let me know if you something else suits better?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Amit Kapila

Date:

09 June 2016, 12:48:48

On Thu, Jun 9, 2016 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> >
> > Here's my proposal:
> >
> > 1. You already implemented a function to find non-frozen tuples on
> > supposedly all-frozen pages. Great.
> >
> > 2. Let's implement a second function to find dead tuples on supposedly
> > all-visible pages.
> >
>
> I am planning to name them as pg_check_frozen and pg_check_visible, let me know if you something else suits better?
>

Attached patch implements the above 2 functions. I have addressed the comments by Sawada San and you in latest patch and updated the documentation as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

pg_check_visibility_func_v2.patch

Re: Reviewing freeze map code

From

Robert Haas

Date:

09 June 2016, 19:11:23

On Thu, Jun 9, 2016 at 5:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Attached patch implements the above 2 functions.  I have addressed the
> comments by Sawada San and you in latest patch and updated the documentation
> as well.

I made a number of changes to this patch.  Here is the new version.

1. The algorithm you were using for growing the array size is unsafe
and can easily overrun the array.  Suppose that each of the first two
pages have some corrupt tuples, more than 50% of MaxHeapTuplesPerPage
but less than the full value of MaxTuplesPerPage.  Your code will
conclude that the array does need to be enlarged after processing the
first page.  I switched this to what I consider the normal coding
pattern for such problems.

2. The all-visible checks seemed to me to be incorrect and incomplete.
I made the check match the logic in lazy_scan_heap.

3. Your 1.0 -> 1.1 upgrade script was missing copies of the REVOKE
statements you added to the 1.1 script.  I added them.

4. The tests as written were not safe under concurrency; they could
return spurious results if the page changed between the time you
checked the visibility map and the time you actually examined the
tuples.  I think people will try running these functions on live
systems, so I changed the code to recheck the VM bits after locking
the page.  Unfortunately, there's either still a concurrency-related
problem here or there's a bug in the all-frozen code itself because I
once managed to get pg_check_frozen('pgbench_accounts') to return a
TID while pgbench was running concurrently.  That's a bit alarming,
but since I can't reproduce it I don't really have a clue how to track
down the problem.

5. I made various cosmetic improvements.

If there are not objections, I will go ahead and commit this tomorrow,
because even if there is a bug (see point #4 above) I think it's
better to have this in the tree than not.  However, code review and/or
testing with these new functions seems like it would be an extremely
good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

check-visibility-v3.patch

Re: Reviewing freeze map code

From

Andres Freund

Date:

09 June 2016, 19:18:28

Hi Robert, Amit,

thanks for working on this.


On 2016-06-09 12:11:15 -0400, Robert Haas wrote:
> 4. The tests as written were not safe under concurrency; they could
> return spurious results if the page changed between the time you
> checked the visibility map and the time you actually examined the
> tuples.  I think people will try running these functions on live
> systems, so I changed the code to recheck the VM bits after locking
> the page.  Unfortunately, there's either still a concurrency-related
> problem here or there's a bug in the all-frozen code itself because I
> once managed to get pg_check_frozen('pgbench_accounts') to return a
> TID while pgbench was running concurrently.  That's a bit alarming,
> but since I can't reproduce it I don't really have a clue how to track
> down the problem.

Ugh, that's a bit concerning.


> If there are not objections, I will go ahead and commit this tomorrow,
> because even if there is a bug (see point #4 above) I think it's
> better to have this in the tree than not.  However, code review and/or
> testing with these new functions seems like it would be an extremely
> good idea.

I'll try to spend some time on that today (code review & testing).


Andres

Re: Reviewing freeze map code

From

Andres Freund

Date:

10 June 2016, 05:33:58

Hi,


I found two, relatively minor, issues.

1) I think we should perform a relkind check in  collect_corrupt_items(). Atm we'll "gladly" run against an index. If
weactually entered the main portion of the loop in  collect_corrupt_items(), that could end up corrupting the table
(via HeapTupleSatisfiesVacuum()). But it's probably safe, because the vm  fork doesn't exist for anything but
heap/toastrelations.
 

2) GetOldestXmin() currently specifies a relation, which can cause  trouble in recovery:
/* * If we're not computing a relation specific limit, or if a shared * relation has been passed in, backends in all
databaseshave to be * considered. */allDbs = rel == NULL || rel->rd_rel->relisshared;
 
/* Cannot look for individual databases during recovery */Assert(allDbs || !RecoveryInProgress());
 I think that needs to be fixed.

3) Harmless here, but I think it's bad policy to release locks  on normal relations before the end of xact.
+    relation_close(rel, AccessShareLock);
+ 
  i.e. we'll Assert out.

4) 
+            if (check_visible)
+            {
+                HTSV_Result state;
+
+                state = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buffer);
+                if (state != HEAPTUPLE_LIVE ||
+                    !HeapTupleHeaderXminCommitted(tuple.t_data))
+                    record_corrupt_item(items, &tuple.t_data->t_ctid);
+                else

This theoretically could give false positives, if GetOldestXmin() went
backwards. But I think that's ok.

5) There's a bunch of whitespace damage in the diff, like        Oid            relid = PG_GETARG_OID(0);
-        MemoryContext    oldcontext;
+        MemoryContext oldcontext;


Otherwise this looks good. I played with it for a while, and besides
finding intentionally caused corruption, it didn't flag anything
(besides crashing on a standby, as in 2)).

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Andres Freund

Date:

10 June 2016, 05:38:45

On 2016-06-09 19:33:52 -0700, Andres Freund wrote:
> I played with it for a while, and besides
> finding intentionally caused corruption, it didn't flag anything
> (besides crashing on a standby, as in 2)).

Ugh. Just sends after I sent that email:
      oid        |    t_ctid    
------------------+--------------pgbench_accounts | (889641,33)pgbench_accounts | (893854,56)pgbench_accounts |
(924226,13)pgbench_accounts| (1073457,51)pgbench_accounts | (1084904,16)pgbench_accounts | (1111996,26)
 
(6 rows)
oid | t_ctid 
-----+--------
(0 rows)
      oid        |    t_ctid    
------------------+--------------pgbench_accounts | (739198,13)pgbench_accounts | (887254,11)pgbench_accounts |
(1050391,6)pgbench_accounts| (1158640,46)pgbench_accounts | (1238067,18)pgbench_accounts | (1273282,22)pgbench_accounts
|(1355816,54)pgbench_accounts | (1361880,33)
 
(8 rows)
oid | t_ctid 
-----+--------
(0 rows)

Seems to be correlated with a concurrent vacuum, but it's hard to tell,
because I didn't have psql output a timestamp.

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Amit Kapila

Date:

10 June 2016, 05:46:13

On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-09 19:33:52 -0700, Andres Freund wrote:
> I played with it for a while, and besides
> finding intentionally caused corruption, it didn't flag anything
> (besides crashing on a standby, as in 2)).

Ugh. Just sends after I sent that email:

oid | t_ctid
------------------+--------------
pgbench_accounts | (889641,33)
pgbench_accounts | (893854,56)
pgbench_accounts | (924226,13)
pgbench_accounts | (1073457,51)
pgbench_accounts | (1084904,16)
pgbench_accounts | (1111996,26)
(6 rows)

oid | t_ctid
-----+--------
(0 rows)

oid | t_ctid
------------------+--------------
pgbench_accounts | (739198,13)
pgbench_accounts | (887254,11)
pgbench_accounts | (1050391,6)
pgbench_accounts | (1158640,46)
pgbench_accounts | (1238067,18)
pgbench_accounts | (1273282,22)
pgbench_accounts | (1355816,54)
pgbench_accounts | (1361880,33)
(8 rows)

Is this output of pg_check_visible() or pg_check_frozen()?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

10 June 2016, 05:58:08


On June 9, 2016 7:46:06 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de>
>wrote:
>
>> On 2016-06-09 19:33:52 -0700, Andres Freund wrote:
>> > I played with it for a while, and besides
>> > finding intentionally caused corruption, it didn't flag anything
>> > (besides crashing on a standby, as in 2)).
>>
>> Ugh. Just sends after I sent that email:
>>
>>        oid        |    t_ctid
>> ------------------+--------------
>>  pgbench_accounts | (889641,33)
>>  pgbench_accounts | (893854,56)
>>  pgbench_accounts | (924226,13)
>>  pgbench_accounts | (1073457,51)
>>  pgbench_accounts | (1084904,16)
>>  pgbench_accounts | (1111996,26)
>> (6 rows)
>>
>>  oid | t_ctid
>> -----+--------
>> (0 rows)
>>
>>        oid        |    t_ctid
>> ------------------+--------------
>>  pgbench_accounts | (739198,13)
>>  pgbench_accounts | (887254,11)
>>  pgbench_accounts | (1050391,6)
>>  pgbench_accounts | (1158640,46)
>>  pgbench_accounts | (1238067,18)
>>  pgbench_accounts | (1273282,22)
>>  pgbench_accounts | (1355816,54)
>>  pgbench_accounts | (1361880,33)
>> (8 rows)
>>
>>
>Is this output of pg_check_visible()  or pg_check_frozen()?

Unfortunately I don't know. I was running a union of both, I didn't really expect to hit an issue... I guess I'll put a
PANICin the relevant places and check whether I cab reproduce. 
 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Reviewing freeze map code

From

Amit Kapila

Date:

10 June 2016, 09:28:35

On Fri, Jun 10, 2016 at 8:27 AM, Andres Freund <andres@anarazel.de> wrote:

On June 9, 2016 7:46:06 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de>
>wrote:
>
>> On 2016-06-09 19:33:52 -0700, Andres Freund wrote:
>> > I played with it for a while, and besides
>> > finding intentionally caused corruption, it didn't flag anything
>> > (besides crashing on a standby, as in 2)).
>>
>> Ugh. Just sends after I sent that email:
>>
>> oid | t_ctid
>> ------------------+--------------
>> pgbench_accounts | (889641,33)
>> pgbench_accounts | (893854,56)
>> pgbench_accounts | (924226,13)
>> pgbench_accounts | (1073457,51)
>> pgbench_accounts | (1084904,16)
>> pgbench_accounts | (1111996,26)
>> (6 rows)
>>
>> oid | t_ctid
>> -----+--------
>> (0 rows)
>>
>> oid | t_ctid
>> ------------------+--------------
>> pgbench_accounts | (739198,13)
>> pgbench_accounts | (887254,11)
>> pgbench_accounts | (1050391,6)
>> pgbench_accounts | (1158640,46)
>> pgbench_accounts | (1238067,18)
>> pgbench_accounts | (1273282,22)
>> pgbench_accounts | (1355816,54)
>> pgbench_accounts | (1361880,33)
>> (8 rows)
>>
>>
>Is this output of pg_check_visible() or pg_check_frozen()?

Unfortunately I don't know. I was running a union of both, I didn't really expect to hit an issue... I guess I'll put a PANIC in the relevant places and check whether I cab reproduce.

I have tried in multiple ways by running pgbench with read-write tests, but could not see any such behaviour. I have tried by even crashing and restarting the server and then again running pgbench. Do you see these records on master or slave?

While looking at code in this area, I observed that during replay of records (heap_xlog_delete), we first clear the vm, then update the page. So we don't have Buffer lock while updating the vm where as in the patch (collect_corrupt_items()), we are relying on the fact that for clearing vm bit one needs to acquire buffer lock. Can that cause a problem?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

10 June 2016, 09:39:35

On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:
> I have tried in multiple ways by running pgbench with read-write tests, but
> could not see any such behaviour.

It took over an hour of pgbench on a fast laptop till I saw it.

> I have tried by even crashing and
> restarting the server and then again running pgbench.  Do you see these
> records on master or slave?

Master, but with an existing standby. So it could be related to
hot_standby_feedback or such.

> While looking at code in this area, I observed that during replay of
> records (heap_xlog_delete), we first clear the vm, then update the page.
> So we don't have Buffer lock while updating the vm where as in the patch
> (collect_corrupt_items()), we are relying on the fact that for clearing vm
> bit one needs to acquire buffer lock.  Can that cause a problem?

Unsetting a vm bit is always safe, right?  The invariant is that the VM
may never falsely say all_visible/frozen, but it's perfectly ok for a
page to be all_visible/frozen, without the VM bit being present.

Andres

Re: Reviewing freeze map code

From

Amit Kapila

Date:

10 June 2016, 09:51:04

On Thu, Jun 9, 2016 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
>
> 2. The all-visible checks seemed to me to be incorrect and incomplete.
> I made the check match the logic in lazy_scan_heap.
>

Okay, I thought we just want to check for dead-tuples. If we want the logic similar to lazy_scan_heap(), then I think we should also consider applying snapshot old threshold limit to oldestxmin. We currently do that in vacuum_set_xid_limits() for Vacuum. Is there a reason for not considering it for visibility check function?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

10 June 2016, 11:47:28

On Fri, Jun 10, 2016 at 1:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 9, 2016 at 5:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Attached patch implements the above 2 functions.  I have addressed the
>> comments by Sawada San and you in latest patch and updated the documentation
>> as well.
>
> I made a number of changes to this patch.  Here is the new version.
>
> 1. The algorithm you were using for growing the array size is unsafe
> and can easily overrun the array.  Suppose that each of the first two
> pages have some corrupt tuples, more than 50% of MaxHeapTuplesPerPage
> but less than the full value of MaxTuplesPerPage.  Your code will
> conclude that the array does need to be enlarged after processing the
> first page.  I switched this to what I consider the normal coding
> pattern for such problems.
>
> 2. The all-visible checks seemed to me to be incorrect and incomplete.
> I made the check match the logic in lazy_scan_heap.
>
> 3. Your 1.0 -> 1.1 upgrade script was missing copies of the REVOKE
> statements you added to the 1.1 script.  I added them.
>
> 4. The tests as written were not safe under concurrency; they could
> return spurious results if the page changed between the time you
> checked the visibility map and the time you actually examined the
> tuples.  I think people will try running these functions on live
> systems, so I changed the code to recheck the VM bits after locking
> the page.  Unfortunately, there's either still a concurrency-related
> problem here or there's a bug in the all-frozen code itself because I
> once managed to get pg_check_frozen('pgbench_accounts') to return a
> TID while pgbench was running concurrently.  That's a bit alarming,
> but since I can't reproduce it I don't really have a clue how to track
> down the problem.
>
> 5. I made various cosmetic improvements.
>
> If there are not objections, I will go ahead and commit this tomorrow,
> because even if there is a bug (see point #4 above) I think it's
> better to have this in the tree than not.  However, code review and/or
> testing with these new functions seems like it would be an extremely
> good idea.
>

Thank you for working on this.
Here are some minor comments.

---
+/*
+ * Return the TIDs of not-all-visible tuples in pages marked all-visible

If there is even one non-visible tuple in pages marked all-visible,
the database might be corrupted.
Is it better "not-visible" or "non-visible" instead of "not-all-visible"?
---
Do we need to check page header flag?
I think that database also might be corrupt in case where there is
non-visible tuple in page set PD_ALL_VISIBLE.
We could emit the WARNING log in such case.

Also, using attached tool which allows us to set spurious visibility
map status without actual modifying the tuple , I manually made the
some situations where database is corrupted and tested it, but ISTM
that this tool works fine.
It doesn't mean proposing as a new feature of course, but please use
it as appropriate.

Regards,

--
Masahiko Sawada

Attachment

Re: Reviewing freeze map code

From

Amit Kapila

Date:

10 June 2016, 16:27:38

On Fri, Jun 10, 2016 at 12:09 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:
>
>
> > While looking at code in this area, I observed that during replay of
> > records (heap_xlog_delete), we first clear the vm, then update the page.
> > So we don't have Buffer lock while updating the vm where as in the patch
> > (collect_corrupt_items()), we are relying on the fact that for clearing vm
> > bit one needs to acquire buffer lock. Can that cause a problem?
>
> Unsetting a vm bit is always safe, right?

I think so, which means this should not be a problem area.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

10 June 2016, 20:59:22

On 2016-06-09 23:39:24 -0700, Andres Freund wrote:
> On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:
> > I have tried in multiple ways by running pgbench with read-write tests, but
> > could not see any such behaviour.
> 
> It took over an hour of pgbench on a fast laptop till I saw it.
> 
> 
> > I have tried by even crashing and
> > restarting the server and then again running pgbench.  Do you see these
> > records on master or slave?
> 
> Master, but with an existing standby. So it could be related to
> hot_standby_feedback or such.

I just managed to trigger it again.


#1  0x00007fa1a73778da in __GI_abort () at abort.c:89
#2  0x00007f9f1395e59c in record_corrupt_item (items=items@entry=0x2137be0, tid=0x7f9fb8681c0c)   at
/home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:612
#3  0x00007f9f1395ead5 in collect_corrupt_items (relid=relid@entry=29449, all_visible=all_visible@entry=0 '\000',
all_frozen=all_frozen@entry=1'\001')   at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:572
 
#4  0x00007f9f1395f476 in pg_check_frozen (fcinfo=0x7ffe5343a200) at
/home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:292
#5  0x00000000005fdbec in ExecMakeTableFunctionResult (funcexpr=0x2168630, econtext=0x2168320, argContext=<optimized
out>,expectedDesc=0x2168ef0,    randomAccess=0 '\000') at
/home/andres/src/postgresql/src/backend/executor/execQual.c:2211
#6  0x0000000000616992 in FunctionNext (node=node@entry=0x2168210) at
/home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:94
#7  0x00000000005ffdcb in ExecScanFetch (recheckMtd=0x6166f0 <FunctionRecheck>, accessMtd=0x616700 <FunctionNext>,
node=0x2168210)  at /home/andres/src/postgresql/src/backend/executor/execScan.c:95
 
#8  ExecScan (node=node@entry=0x2168210, accessMtd=accessMtd@entry=0x616700 <FunctionNext>,
recheckMtd=recheckMtd@entry=0x6166f0<FunctionRecheck>)   at
/home/andres/src/postgresql/src/backend/executor/execScan.c:145
#9  0x00000000006169e4 in ExecFunctionScan (node=node@entry=0x2168210) at
/home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:268

the error happened just after I restarted a standby, so it's not
unlikely to be related to hot_standby_feedback.


(gdb) p *tuple.t_data
$5 = {t_choice = {t_heap = {t_xmin = 9105470, t_xmax = 26049273, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum =
{datum_len_= 9105470,      datum_typmod = 26049273, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 1, bi_lo =
19765},ip_posid = 3}, t_infomask2 = 4, t_infomask = 770,  t_hoff = 24 '\030', t_bits = 0x7f9fb8681c17 ""}
 

Infomask is:
#define HEAP_XMIN_COMMITTED        0x0100    /* t_xmin committed */
#define HEAP_XMIN_INVALID        0x0200    /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_HASVARWIDTH        0x0002    /* has variable-width attribute(s) */

This indeed looks borked.  Such a tuple should never survive    if (check_frozen && !VM_ALL_FROZEN(rel, blkno,
&vmbuffer))       check_frozen = false;
 
especially not when
(gdb) p PageIsAllVisible(page)
$3 = 4

(fwiw, checking PD_ALL_VISIBLE in those functions sounds like a good plan)


I've got another earlier case (that I somehow missed seeing), below
check_visible:

(gdb) p *tuple->t_data 
$2 = {t_choice = {t_heap = {t_xmin = 13616549, t_xmax = 25210801, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum =
{datum_len_= 13616549,      datum_typmod = 25210801, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 0, bi_lo =
52320},ip_posid = 67}, t_infomask2 = 32772, t_infomask = 8962,  t_hoff = 24 '\030', t_bits = 0x7f9fda2f8717 ""}
 

infomask is:
#define HEAP_UPDATED                    0x2000  /* this is UPDATEd version of row */
#define HEAP_XMIN_COMMITTED        0x0100    /* t_xmin committed */
#define HEAP_XMIN_INVALID        0x0200    /* t_xmin invalid/aborted */
#define HEAP_HASVARWIDTH        0x0002    /* has variable-width attribute(s) */
infomask2 is:
#define HEAP_ONLY_TUPLE                 0x8000  /* this is heap-only tuple */

I'll run again, with a debugger attached, maybe I can get some more
information.


Regards,

Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

10 June 2016, 22:54:17

On Fri, Jun 10, 2016 at 1:59 PM, Andres Freund <andres@anarazel.de> wrote:
>> Master, but with an existing standby. So it could be related to
>> hot_standby_feedback or such.
>
> I just managed to trigger it again.
>
>
> #1  0x00007fa1a73778da in __GI_abort () at abort.c:89
> #2  0x00007f9f1395e59c in record_corrupt_item (items=items@entry=0x2137be0, tid=0x7f9fb8681c0c)
>     at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:612
> #3  0x00007f9f1395ead5 in collect_corrupt_items (relid=relid@entry=29449, all_visible=all_visible@entry=0 '\000',
all_frozen=all_frozen@entry=1'\001')

>     at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:572
> #4  0x00007f9f1395f476 in pg_check_frozen (fcinfo=0x7ffe5343a200) at
/home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:292
> #5  0x00000000005fdbec in ExecMakeTableFunctionResult (funcexpr=0x2168630, econtext=0x2168320, argContext=<optimized
out>,expectedDesc=0x2168ef0,

>     randomAccess=0 '\000') at /home/andres/src/postgresql/src/backend/executor/execQual.c:2211
> #6  0x0000000000616992 in FunctionNext (node=node@entry=0x2168210) at
/home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:94
> #7  0x00000000005ffdcb in ExecScanFetch (recheckMtd=0x6166f0 <FunctionRecheck>, accessMtd=0x616700 <FunctionNext>,
node=0x2168210)
>     at /home/andres/src/postgresql/src/backend/executor/execScan.c:95
> #8  ExecScan (node=node@entry=0x2168210, accessMtd=accessMtd@entry=0x616700 <FunctionNext>,
recheckMtd=recheckMtd@entry=0x6166f0<FunctionRecheck>)

>     at /home/andres/src/postgresql/src/backend/executor/execScan.c:145
> #9  0x00000000006169e4 in ExecFunctionScan (node=node@entry=0x2168210) at
/home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:268
>
> the error happened just after I restarted a standby, so it's not
> unlikely to be related to hot_standby_feedback.

After some off-list discussion and debugging, Andres and I have
managed to identify three issues here (so far).  Two are issues in the
testing, and one is a data-corrupting bug in the freeze map code.

1. pg_check_visible keeps on using the same OldestXmin for all its
checks even though the real OldestXmin may advance in the meantime.
This can lead to spurious problem reports: pg_check_visible() thinks
that the tuple isn't all visible yet and reports it as corruption, but
in reality there's no problem.

2. pg_check_visible includes the same check for heap-xmin-committed
that vacuumlazy.c uses, but hint bits aren't crash safe, so this could
lead to a spurious trouble report in a scenario involving a crash.

3. vacuumlazy.c includes this code:
               if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
MultiXactCutoff,&frozen[nfrozen]))                   frozen[nfrozen++].offset = offnum;               else if
(heap_tuple_needs_eventual_freeze(tuple.t_data))                  all_frozen = false;

That's wrong, because a "true" return value from
heap_prepare_freeze_tuple() means only that it has done *some*
freezing work on the tuple, not that it's done all of the freezing
work that will ever need to be done.  So, if the tuple's xmin can be
frozen and is aborted but not older than vacuum_freeze_min_age, then
heap_prepare_freeze_tuple() won't free xmax, but the page will still
be marked all-frozen, which is bad.  I think it normally won't matter
because the xmax will probably be hinted invalid anyway, since we just
pruned the page which should have set hint bits everywhere, but if
those hint bits were lost then we'd eventually end up with an
accessible xmax pointing off into space.

My first thought was to just delete the "else" but that would be bad
because we'd fail to set all-frozen immediately in a lot of cases
where we should.  This needs a bit more thought than I have time to
give it right now.

(I will update on the status of this open item again no later than
Monday; probably sooner.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

10 June 2016, 23:55:33

Robert Haas wrote:

> 3. vacuumlazy.c includes this code:
> 
>                 if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
>                                           MultiXactCutoff, &frozen[nfrozen]))
>                     frozen[nfrozen++].offset = offnum;
>                 else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
>                     all_frozen = false;
> 
> That's wrong, because a "true" return value from
> heap_prepare_freeze_tuple() means only that it has done *some*
> freezing work on the tuple, not that it's done all of the freezing
> work that will ever need to be done.  So, if the tuple's xmin can be
> frozen and is aborted but not older than vacuum_freeze_min_age, then
> heap_prepare_freeze_tuple() won't free xmax, but the page will still
> be marked all-frozen, which is bad.  I think it normally won't matter
> because the xmax will probably be hinted invalid anyway, since we just
> pruned the page which should have set hint bits everywhere, but if
> those hint bits were lost then we'd eventually end up with an
> accessible xmax pointing off into space.

Good catch.  Also consider multixact freezing: if there is a
long-running transaction which is a lock-only member of tuple's Xmax,
and the multixact needs freezing because it's older than the multixact
cutoff, we set the xmax to a new multixact which includes that old
locker.  See FreezeMultiXactId.

> My first thought was to just delete the "else" but that would be bad
> because we'd fail to set all-frozen immediately in a lot of cases
> where we should.  This needs a bit more thought than I have time to
> give it right now.

How about changing the return tuple of heap_prepare_freeze_tuple to
a bitmap?  Two flags: "Freeze [not] done" and "[No] more freezing
needed"

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

Robert Haas

Date:

12 June 2016, 00:01:09

On Fri, Jun 10, 2016 at 4:55 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> 3. vacuumlazy.c includes this code:
>>
>>                 if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
>>                                           MultiXactCutoff, &frozen[nfrozen]))
>>                     frozen[nfrozen++].offset = offnum;
>>                 else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
>>                     all_frozen = false;
>>
>> That's wrong, because a "true" return value from
>> heap_prepare_freeze_tuple() means only that it has done *some*
>> freezing work on the tuple, not that it's done all of the freezing
>> work that will ever need to be done.  So, if the tuple's xmin can be
>> frozen and is aborted but not older than vacuum_freeze_min_age, then
>> heap_prepare_freeze_tuple() won't free xmax, but the page will still
>> be marked all-frozen, which is bad.  I think it normally won't matter
>> because the xmax will probably be hinted invalid anyway, since we just
>> pruned the page which should have set hint bits everywhere, but if
>> those hint bits were lost then we'd eventually end up with an
>> accessible xmax pointing off into space.
>
> Good catch.  Also consider multixact freezing: if there is a
> long-running transaction which is a lock-only member of tuple's Xmax,
> and the multixact needs freezing because it's older than the multixact
> cutoff, we set the xmax to a new multixact which includes that old
> locker.  See FreezeMultiXactId.
>
>> My first thought was to just delete the "else" but that would be bad
>> because we'd fail to set all-frozen immediately in a lot of cases
>> where we should.  This needs a bit more thought than I have time to
>> give it right now.
>
> How about changing the return tuple of heap_prepare_freeze_tuple to
> a bitmap?  Two flags: "Freeze [not] done" and "[No] more freezing
> needed"

Yes, I think something like that sounds about right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Amit Kapila

Date:

13 June 2016, 08:37:25

On Sat, Jun 11, 2016 at 1:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> 3. vacuumlazy.c includes this code:
>
> if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
> MultiXactCutoff, &frozen[nfrozen]))
> frozen[nfrozen++].offset = offnum;
> else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
> all_frozen = false;
>
> That's wrong, because a "true" return value from
> heap_prepare_freeze_tuple() means only that it has done *some*
> freezing work on the tuple, not that it's done all of the freezing
> work that will ever need to be done. So, if the tuple's xmin can be
> frozen and is aborted but not older than vacuum_freeze_min_age, then
> heap_prepare_freeze_tuple() won't free xmax, but the page will still
> be marked all-frozen, which is bad.
>

To clarify, are you talking about a case where insertion has aborted? Won't in such a case all_visible flag be set to false due to return value from HeapTupleSatisfiesVacuum() and if so, later code shouldn't mark it as all_frozen?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Robert Haas

Date:

13 June 2016, 19:02:57

On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> How about changing the return tuple of heap_prepare_freeze_tuple to
>> a bitmap?  Two flags: "Freeze [not] done" and "[No] more freezing
>> needed"
>
> Yes, I think something like that sounds about right.

Here's a patch.  I took the approach of adding a separate bool out
parameter instead.  I am also attaching an update of the
check-visibility patch which responds to assorted review comments and
adjusting it for the problems found on Friday which could otherwise
lead to false positives.  I'm still getting occasional TIDs from the
pg_check_visible() function during pgbench runs, though, so evidently
not all is well with the world.

(Official status update: I'm hoping that senior hackers will carefully
review these patches for defects.  If they do not, I plan to commit
the patches anyway neither less than 48 nor more than 60 hours from
now after re-reviewing them myself.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Reviewing freeze map code

From

Andres Freund

Date:

13 June 2016, 20:09:50


On June 13, 2016 11:02:42 AM CDT, Robert Haas <robertmhaas@gmail.com> wrote:

>(Official status update: I'm hoping that senior hackers will carefully
>review these patches for defects.  If they do not, I plan to commit
>the patches anyway neither less than 48 nor more than 60 hours from
>now after re-reviewing them myself.)

I'm traveling today and tomorrow, but will look after that.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Reviewing freeze map code

From

Thomas Munro

Date:

14 June 2016, 09:53:52

On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> How about changing the return tuple of heap_prepare_freeze_tuple to
>>> a bitmap?  Two flags: "Freeze [not] done" and "[No] more freezing
>>> needed"
>>
>> Yes, I think something like that sounds about right.
>
> Here's a patch.  I took the approach of adding a separate bool out
> parameter instead.  I am also attaching an update of the
> check-visibility patch which responds to assorted review comments and
> adjusting it for the problems found on Friday which could otherwise
> lead to false positives.  I'm still getting occasional TIDs from the
> pg_check_visible() function during pgbench runs, though, so evidently
> not all is well with the world.

I'm still working out how half this stuff works, but I managed to get
pg_check_visible() to spit out a row every few seconds with the
following brute force approach:

CREATE TABLE foo (n int);
INSERT INTO foo SELECT generate_series(1, 100000);

Three client threads (see attached script):
1.  Run VACUUM in a tight loop.
2.  Run UPDATE foo SET n = n + 1 in a tight loop.
3.  Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and
print out any rows it produces.

I noticed that the tuples that it reported were always offset 1 in a
page, and that the page always had a maxoff over a couple of hundred,
and that we called record_corrupt_item because VM_ALL_VISIBLE returned
true but HeapTupleSatisfiesVacuum on the first tuple returned
HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
It did that because HEAP_XMAX_COMMITTED was not set and
TransactionIdIsInProgress returned true for xmax.

Not sure how much of this was already obvious!  I will poke at it some
more tomorrow.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

test.py

Re: Reviewing freeze map code

From

Robert Haas

Date:

14 June 2016, 15:08:27

On Tue, Jun 14, 2016 at 2:53 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> How about changing the return tuple of heap_prepare_freeze_tuple to
>>>> a bitmap?  Two flags: "Freeze [not] done" and "[No] more freezing
>>>> needed"
>>>
>>> Yes, I think something like that sounds about right.
>>
>> Here's a patch.  I took the approach of adding a separate bool out
>> parameter instead.  I am also attaching an update of the
>> check-visibility patch which responds to assorted review comments and
>> adjusting it for the problems found on Friday which could otherwise
>> lead to false positives.  I'm still getting occasional TIDs from the
>> pg_check_visible() function during pgbench runs, though, so evidently
>> not all is well with the world.
>
> I'm still working out how half this stuff works, but I managed to get
> pg_check_visible() to spit out a row every few seconds with the
> following brute force approach:
>
> CREATE TABLE foo (n int);
> INSERT INTO foo SELECT generate_series(1, 100000);
>
> Three client threads (see attached script):
> 1.  Run VACUUM in a tight loop.
> 2.  Run UPDATE foo SET n = n + 1 in a tight loop.
> 3.  Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and
> print out any rows it produces.
>
> I noticed that the tuples that it reported were always offset 1 in a
> page, and that the page always had a maxoff over a couple of hundred,
> and that we called record_corrupt_item because VM_ALL_VISIBLE returned
> true but HeapTupleSatisfiesVacuum on the first tuple returned
> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
> It did that because HEAP_XMAX_COMMITTED was not set and
> TransactionIdIsInProgress returned true for xmax.

So this seems like it might be a visibility map bug rather than a bug
in the test code, but I'm not completely sure of that.  How was it
legitimate to mark the page as all-visible if a tuple on the page
still had a live xmax?  If xmax is live and not just a locker then the
tuple is not visible to the transaction that wrote xmax, at least.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

14 June 2016, 15:11:13

On Tue, Jun 14, 2016 at 8:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 14, 2016 at 2:53 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>> How about changing the return tuple of heap_prepare_freeze_tuple to
>>>>> a bitmap?  Two flags: "Freeze [not] done" and "[No] more freezing
>>>>> needed"
>>>>
>>>> Yes, I think something like that sounds about right.
>>>
>>> Here's a patch.  I took the approach of adding a separate bool out
>>> parameter instead.  I am also attaching an update of the
>>> check-visibility patch which responds to assorted review comments and
>>> adjusting it for the problems found on Friday which could otherwise
>>> lead to false positives.  I'm still getting occasional TIDs from the
>>> pg_check_visible() function during pgbench runs, though, so evidently
>>> not all is well with the world.
>>
>> I'm still working out how half this stuff works, but I managed to get
>> pg_check_visible() to spit out a row every few seconds with the
>> following brute force approach:
>>
>> CREATE TABLE foo (n int);
>> INSERT INTO foo SELECT generate_series(1, 100000);
>>
>> Three client threads (see attached script):
>> 1.  Run VACUUM in a tight loop.
>> 2.  Run UPDATE foo SET n = n + 1 in a tight loop.
>> 3.  Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and
>> print out any rows it produces.
>>
>> I noticed that the tuples that it reported were always offset 1 in a
>> page, and that the page always had a maxoff over a couple of hundred,
>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned
>> true but HeapTupleSatisfiesVacuum on the first tuple returned
>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
>> It did that because HEAP_XMAX_COMMITTED was not set and
>> TransactionIdIsInProgress returned true for xmax.
>
> So this seems like it might be a visibility map bug rather than a bug
> in the test code, but I'm not completely sure of that.  How was it
> legitimate to mark the page as all-visible if a tuple on the page
> still had a live xmax?  If xmax is live and not just a locker then the
> tuple is not visible to the transaction that wrote xmax, at least.

Ah, wait a minute.  I see how this could happen.  Hang on, let me
update the pg_visibility patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

14 June 2016, 15:44:51

On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I noticed that the tuples that it reported were always offset 1 in a
>>> page, and that the page always had a maxoff over a couple of hundred,
>>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned
>>> true but HeapTupleSatisfiesVacuum on the first tuple returned
>>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
>>> It did that because HEAP_XMAX_COMMITTED was not set and
>>> TransactionIdIsInProgress returned true for xmax.
>>
>> So this seems like it might be a visibility map bug rather than a bug
>> in the test code, but I'm not completely sure of that.  How was it
>> legitimate to mark the page as all-visible if a tuple on the page
>> still had a live xmax?  If xmax is live and not just a locker then the
>> tuple is not visible to the transaction that wrote xmax, at least.
>
> Ah, wait a minute.  I see how this could happen.  Hang on, let me
> update the pg_visibility patch.

The problem should be fixed in the attached revision of
pg_check_visible.  I think what happened is:

1. pg_check_visible computed an OldestXmin.
2. Some transaction committed.
3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
4. pg_check_visible then used its older OldestXmin to check the
visibility of tuples on that page, and saw delete-in-progress as a
result.

I added a guard against a similar scenario involving xmin in the last
version of this patch, but forgot that we need to protect xmax in the
same way.  With this version of the patch, I can no longer get any
TIDs to pop up out of pg_check_visible in my testing.  (I haven't run
your test script for lack of the proper Python environment...)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

check-visibility-v5.patch

Re: Reviewing freeze map code

From

Thomas Munro

Date:

15 June 2016, 02:43:42

On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> I noticed that the tuples that it reported were always offset 1 in a
>>>> page, and that the page always had a maxoff over a couple of hundred,
>>>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned
>>>> true but HeapTupleSatisfiesVacuum on the first tuple returned
>>>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
>>>> It did that because HEAP_XMAX_COMMITTED was not set and
>>>> TransactionIdIsInProgress returned true for xmax.
>>>
>>> So this seems like it might be a visibility map bug rather than a bug
>>> in the test code, but I'm not completely sure of that.  How was it
>>> legitimate to mark the page as all-visible if a tuple on the page
>>> still had a live xmax?  If xmax is live and not just a locker then the
>>> tuple is not visible to the transaction that wrote xmax, at least.
>>
>> Ah, wait a minute.  I see how this could happen.  Hang on, let me
>> update the pg_visibility patch.
>
> The problem should be fixed in the attached revision of
> pg_check_visible.  I think what happened is:
>
> 1. pg_check_visible computed an OldestXmin.
> 2. Some transaction committed.
> 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
> 4. pg_check_visible then used its older OldestXmin to check the
> visibility of tuples on that page, and saw delete-in-progress as a
> result.
>
> I added a guard against a similar scenario involving xmin in the last
> version of this patch, but forgot that we need to protect xmax in the
> same way.  With this version of the patch, I can no longer get any
> TIDs to pop up out of pg_check_visible in my testing.  (I haven't run
> your test script for lack of the proper Python environment...)

I can still reproduce the problem with this new patch.  What I see is
that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax
are all the same.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Thomas Munro

Date:

15 June 2016, 09:41:36

On Wed, Jun 15, 2016 at 11:43 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>> I noticed that the tuples that it reported were always offset 1 in a
>>>>> page, and that the page always had a maxoff over a couple of hundred,
>>>>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned
>>>>> true but HeapTupleSatisfiesVacuum on the first tuple returned
>>>>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
>>>>> It did that because HEAP_XMAX_COMMITTED was not set and
>>>>> TransactionIdIsInProgress returned true for xmax.
>>>>
>>>> So this seems like it might be a visibility map bug rather than a bug
>>>> in the test code, but I'm not completely sure of that.  How was it
>>>> legitimate to mark the page as all-visible if a tuple on the page
>>>> still had a live xmax?  If xmax is live and not just a locker then the
>>>> tuple is not visible to the transaction that wrote xmax, at least.
>>>
>>> Ah, wait a minute.  I see how this could happen.  Hang on, let me
>>> update the pg_visibility patch.
>>
>> The problem should be fixed in the attached revision of
>> pg_check_visible.  I think what happened is:
>>
>> 1. pg_check_visible computed an OldestXmin.
>> 2. Some transaction committed.
>> 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
>> 4. pg_check_visible then used its older OldestXmin to check the
>> visibility of tuples on that page, and saw delete-in-progress as a
>> result.
>>
>> I added a guard against a similar scenario involving xmin in the last
>> version of this patch, but forgot that we need to protect xmax in the
>> same way.  With this version of the patch, I can no longer get any
>> TIDs to pop up out of pg_check_visible in my testing.  (I haven't run
>> your test script for lack of the proper Python environment...)
>
> I can still reproduce the problem with this new patch.  What I see is
> that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax
> are all the same.

I spent some time chasing down the exact circumstances.  I suspect
that there may be an interlocking problem in heap_update.  Using the
line numbers from cae1c788 [1], I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
in reference to the same block number:
 [VACUUM] sets all visible bit
 [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple); [UPDATE] heapam.c:3938
LockBuffer(buffer,BUFFER_LOCK_UNLOCK);
 
 [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); [SELECT] observes VM_ALL_VISIBLE as true [SELECT] observes tuple in
HEAPTUPLE_DELETE_IN_PROGRESSstate [SELECT] barfs
 
 [UPDATE] heapam.c:4116 visibilitymap_clear(...)

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;hb=cae1c788b9b43887e4a4fa51a11c3a8ffa334939

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Robert Haas

Date:

15 June 2016, 15:57:03

On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I spent some time chasing down the exact circumstances.  I suspect
> that there may be an interlocking problem in heap_update.  Using the
> line numbers from cae1c788 [1], I see the following interaction
> between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> in reference to the same block number:
>
>   [VACUUM] sets all visible bit
>
>   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
>   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>
>   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
>   [SELECT] observes VM_ALL_VISIBLE as true
>   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
>   [SELECT] barfs
>
>   [UPDATE] heapam.c:4116 visibilitymap_clear(...)

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out.  Only if that other work all goes OK do we
relock the page and perform the WAL-logged actions.

That doesn't seem like a good idea even in existing releases, because
you've taken a tuple on an all-visible page and made it not
all-visible, and you've made a page modification that is not
necessarily atomic without logging it.  This is is particularly bad in
9.6, because if that page is also all-frozen then XMAX will eventually
be pointing into space and VACUUM will never visit the page to
re-freeze it the way it would have done in earlier releases.  However,
even in older releases, I think there's a remote possibility of data
corruption.  Backend #1 makes these changes to the page, releases the
lock, and errors out.  Backend #2 writes the page to the OS.  DBA
takes a hot backup, tearing the page in the middle of XMAX.  Oops.

I'm not sure what to do about this: this part of the heap_update()
logic has been like this forever, and I assume that if it were easy to
refactor this away, somebody would have done it by now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Amit Kapila

Date:

15 June 2016, 16:43:11

On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I spent some time chasing down the exact circumstances. I suspect
> > that there may be an interlocking problem in heap_update. Using the
> > line numbers from cae1c788 [1], I see the following interaction
> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> > in reference to the same block number:
> >
> > [VACUUM] sets all visible bit
> >
> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >
> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> > [SELECT] observes VM_ALL_VISIBLE as true
> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> > [SELECT] barfs
> >
> > [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.

Can't we clear the all-visible flag before releasing the lock? We can use logic of already_marked as it is currently used in code to clear it just once.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Robert Haas

Date:

15 June 2016, 16:43:59

On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>> > I spent some time chasing down the exact circumstances.  I suspect
>> > that there may be an interlocking problem in heap_update.  Using the
>> > line numbers from cae1c788 [1], I see the following interaction
>> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
>> > in reference to the same block number:
>> >
>> >   [VACUUM] sets all visible bit
>> >
>> >   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
>> > xmax_old_tuple);
>> >   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>> >
>> >   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
>> >   [SELECT] observes VM_ALL_VISIBLE as true
>> >   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
>> >   [SELECT] barfs
>> >
>> >   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>>
>> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
>> and CTID without logging anything or clearing the all-visible flag and
>> then releases the lock on the heap page to go do some more work that
>> might even ERROR out.
>
> Can't we clear the all-visible flag before releasing the lock?  We can use
> logic of already_marked as it is currently used in code to clear it just
> once.

That just kicks the can down the road.  Then you have PD_ALL_VISIBLE
clear but the VM bit is still set.  And you still haven't WAL-logged
anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Amit Kapila

Date:

15 June 2016, 16:59:15

On Wed, Jun 15, 2016 at 7:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> >> <thomas.munro@enterprisedb.com> wrote:
> >> > I spent some time chasing down the exact circumstances. I suspect
> >> > that there may be an interlocking problem in heap_update. Using the
> >> > line numbers from cae1c788 [1], I see the following interaction
> >> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> >> > in reference to the same block number:
> >> >
> >> > [VACUUM] sets all visible bit
> >> >
> >> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
> >> > xmax_old_tuple);
> >> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >> >
> >> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> >> > [SELECT] observes VM_ALL_VISIBLE as true
> >> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> >> > [SELECT] barfs
> >> >
> >> > [UPDATE] heapam.c:4116 visibilitymap_clear(...)
> >>
> >> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> >> and CTID without logging anything or clearing the all-visible flag and
> >> then releases the lock on the heap page to go do some more work that
> >> might even ERROR out.
> >
> > Can't we clear the all-visible flag before releasing the lock? We can use
> > logic of already_marked as it is currently used in code to clear it just
> > once.
>
> That just kicks the can down the road. Then you have PD_ALL_VISIBLE
> clear but the VM bit is still set.

I mean to say clear both as we are doing currently in code:

if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}

>
> And you still haven't WAL-logged
> anything.
>

Yeah, I think WAL requirement is more difficult to meet and I think releasing the lock on buffer before writing WAL could lead to flush of such a buffer before WAL.

I feel this is an existing-bug and should go to Older Bugs Section in open items page.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

15 June 2016, 17:04:37

On Wed, Jun 15, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> I spent some time chasing down the exact circumstances.  I suspect
>> that there may be an interlocking problem in heap_update.  Using the
>> line numbers from cae1c788 [1], I see the following interaction
>> between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
>> in reference to the same block number:
>>
>>   [VACUUM] sets all visible bit
>>
>>   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
>>   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>>
>>   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
>>   [SELECT] observes VM_ALL_VISIBLE as true
>>   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
>>   [SELECT] barfs
>>
>>   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.  Only if that other work all goes OK do we
> relock the page and perform the WAL-logged actions.
>
> That doesn't seem like a good idea even in existing releases, because
> you've taken a tuple on an all-visible page and made it not
> all-visible, and you've made a page modification that is not
> necessarily atomic without logging it.  This is is particularly bad in
> 9.6, because if that page is also all-frozen then XMAX will eventually
> be pointing into space and VACUUM will never visit the page to
> re-freeze it the way it would have done in earlier releases.  However,
> even in older releases, I think there's a remote possibility of data
> corruption.  Backend #1 makes these changes to the page, releases the
> lock, and errors out.  Backend #2 writes the page to the OS.  DBA
> takes a hot backup, tearing the page in the middle of XMAX.  Oops.
>
> I'm not sure what to do about this: this part of the heap_update()
> logic has been like this forever, and I assume that if it were easy to
> refactor this away, somebody would have done it by now.
>

How about changing collect_corrupt_items to acquire
AccessExclusiveLock for safely checking?

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Robert Haas

Date:

15 June 2016, 17:48:31

On Wed, Jun 15, 2016 at 10:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I'm not sure what to do about this: this part of the heap_update()
>> logic has been like this forever, and I assume that if it were easy to
>> refactor this away, somebody would have done it by now.
>
> How about changing collect_corrupt_items to acquire
> AccessExclusiveLock for safely checking?

Well, that would make it a lot less likely for
pg_check_{visible,frozen} to detect the bug in heap_update(), but it
wouldn't fix the bug in heap_update().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

15 June 2016, 17:50:51

On Wed, Jun 15, 2016 at 9:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> That just kicks the can down the road.  Then you have PD_ALL_VISIBLE
>> clear but the VM bit is still set.
>
> I mean to say clear both as we are doing currently in code:
> if (PageIsAllVisible(BufferGetPage(buffer)))
> {
> all_visible_cleared = true;
> PageClearAllVisible(BufferGetPage(buffer));
> visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
> vmbuffer);
> }

Sure, but without emitting a WAL record, that's just broken.  You
could have the heap page get flushed to disk and the VM page not get
flushed to disk, and then crash, and now you have the classic VM
corruption scenario.

>>   And you still haven't WAL-logged
>> anything.
>
> Yeah, I think WAL requirement is more difficult to meet and I think
> releasing the lock on buffer before writing WAL could lead to flush of such
> a buffer before WAL.
>
> I feel this is an existing-bug and should go to Older Bugs Section in open
> items page.

It does seem to be an existing bug, but the freeze map makes the
problem more serious, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Noah Misch

Date:

17 June 2016, 06:36:35

On Wed, Jun 15, 2016 at 08:56:52AM -0400, Robert Haas wrote:
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I spent some time chasing down the exact circumstances.  I suspect
> > that there may be an interlocking problem in heap_update.  Using the
> > line numbers from cae1c788 [1], I see the following interaction
> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> > in reference to the same block number:
> >
> >   [VACUUM] sets all visible bit
> >
> >   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
> >   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >
> >   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> >   [SELECT] observes VM_ALL_VISIBLE as true
> >   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> >   [SELECT] barfs
> >
> >   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
> 
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.  Only if that other work all goes OK do we
> relock the page and perform the WAL-logged actions.
> 
> That doesn't seem like a good idea even in existing releases, because
> you've taken a tuple on an all-visible page and made it not
> all-visible, and you've made a page modification that is not
> necessarily atomic without logging it.  This is is particularly bad in
> 9.6, because if that page is also all-frozen then XMAX will eventually
> be pointing into space and VACUUM will never visit the page to
> re-freeze it the way it would have done in earlier releases.  However,
> even in older releases, I think there's a remote possibility of data
> corruption.  Backend #1 makes these changes to the page, releases the
> lock, and errors out.  Backend #2 writes the page to the OS.  DBA
> takes a hot backup, tearing the page in the middle of XMAX.  Oops.

I agree the non-atomic, unlogged change is a problem.  A related threat
doesn't require a torn page:
 AssignTransactionId() xid=123 heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123); some ERROR before
heap_update()finishes rollback;  -- xid=123 some backend flushes the modified page immediate shutdown
AssignTransactionId()xid=123 commit;  -- xid=123

If nothing wrote an xlog record that witnesses xid 123, the cluster can
reassign it after recovery.  The failed update is now considered a successful
update, and the row improperly becomes dead.  That's important.

I don't know whether the 9.6 all-frozen mechanism materially amplifies the
consequences of this bug.  The interaction with visibility map and freeze map
is not all bad; indeed, it can reduce the risk of experiencing consequences
from the non-atomic, unlogged change bug.  If the row is all-visible when
heap_update() starts, every transaction should continue to consider the row
visible until heap_update() finishes successfully.  If an ERROR interrupts
heap_update(), visibility verdicts should be as though the heap_update() never
happened.  If one of the previously-described mechanisms would make an xmax
visibility test give the wrong answer, an all-visible bit could mask the
problem for awhile.  Having said that, freeze map hurts in scenarios involving
toast_insert_or_update() failures and no crash recovery.  Instead of VACUUM
cleaning up the aborted xmax, that xmax could persist long enough for its xid
to be reused in a successful transaction.  When some other modification
finally clears all-frozen and all-visible, the row improperly becomes dead.
Both scenarios are fairly rare; I don't know which is more rare.  [Disclaimer:
I have not built tests cases to verify those alleged failure mechanisms.]

If we made this pre-9.6 bug a 9.6 open item, would anyone volunteer to own it?
Then we wouldn't need to guess whether 9.6 will be safer with the freeze map
or safer without the freeze map.

Thanks,
nm

Re: Reviewing freeze map code

From

Andres Freund

Date:

20 June 2016, 22:33:09

On 2016-06-15 08:56:52 -0400, Robert Haas wrote:
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.  Only if that other work all goes OK do we
> relock the page and perform the WAL-logged actions.
> 
> That doesn't seem like a good idea even in existing releases, because
> you've taken a tuple on an all-visible page and made it not
> all-visible, and you've made a page modification that is not
> necessarily atomic without logging it.

Right, that's broken.

> I'm not sure what to do about this: this part of the heap_update()
> logic has been like this forever, and I assume that if it were easy to
> refactor this away, somebody would have done it by now.

Well, I think generally nobody seriously looked at actually refactoring
heap_update(), even though that'd be a good idea.  But in this instance,
the problem seems relatively fundamental:

We need to lock the origin page, to do visibility checks, etc. Then we
need to figure out the target page. Even disregarding toasting - which
we could be doing earlier with some refactoring - we're going to have to
release the page level lock, to lock them in ascending order. Otherwise
we'll risk kinda likely deadlocks.  We also certainly don't want to nest
the lwlocks for the toast stuff, inside a content lock for the old
tupe's page.

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need an
xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue.  It's kinda invasive though, and probably has performance
implications.

Does anybody have a better idea?

Regards,

Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

20 June 2016, 23:10:33

On Mon, Jun 20, 2016 at 3:33 PM, Andres Freund <andres@anarazel.de> wrote:
>> I'm not sure what to do about this: this part of the heap_update()
>> logic has been like this forever, and I assume that if it were easy to
>> refactor this away, somebody would have done it by now.
>
> Well, I think generally nobody seriously looked at actually refactoring
> heap_update(), even though that'd be a good idea.  But in this instance,
> the problem seems relatively fundamental:
>
> We need to lock the origin page, to do visibility checks, etc. Then we
> need to figure out the target page. Even disregarding toasting - which
> we could be doing earlier with some refactoring - we're going to have to
> release the page level lock, to lock them in ascending order. Otherwise
> we'll risk kinda likely deadlocks.  We also certainly don't want to nest
> the lwlocks for the toast stuff, inside a content lock for the old
> tupe's page.
>
> So far the best idea I have - and it's really not a good one - is to
> invent a new hint-bit that tells concurrent updates to acquire a
> heavyweight tuple lock, while releasing the page-level lock. If that
> hint bit does not require any other modifications - and we don't need an
> xid in xmax for this use case - that'll avoid doing all the other
> `already_marked` stuff early, which should address the correctness
> issue.  It's kinda invasive though, and probably has performance
> implications.
>
> Does anybody have a better idea?

What exactly is the point of all of that already_marked stuff?  I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

20 June 2016, 23:24:59

On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
> What exactly is the point of all of that already_marked stuff?

Preventing the old tuple from being locked/updated by another backend,
while unlocking the buffer.

> I
> mean, suppose we just don't do any of that before we go off to do
> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
> when we reacquire the page lock, we might find that somebody else has
> already updated the tuple, but couldn't that be handled by
> (approximately) looping back up to l2 just as we do in several other
> cases?

We'd potentially have to undo a fair amount more work: the toasted data
would have to be deleted and such, just to retry. Which isn't going to
super easy, because all of it will be happening with the current cid (we
can't just increase CommandCounterIncrement() for correctness reasons).

Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

21 June 2016, 00:55:33

On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>> What exactly is the point of all of that already_marked stuff?
>
> Preventing the old tuple from being locked/updated by another backend,
> while unlocking the buffer.
>
>> I
>> mean, suppose we just don't do any of that before we go off to do
>> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>> when we reacquire the page lock, we might find that somebody else has
>> already updated the tuple, but couldn't that be handled by
>> (approximately) looping back up to l2 just as we do in several other
>> cases?
>
> We'd potentially have to undo a fair amount more work: the toasted data
> would have to be deleted and such, just to retry. Which isn't going to
> super easy, because all of it will be happening with the current cid (we
> can't just increase CommandCounterIncrement() for correctness reasons).

Why would we have to delete the TOAST data?  AFAIUI, the tuple points
to the TOAST data, but not the other way around.  So if we change our
mind about where to put the tuple, I don't think that requires
re-TOASTing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

21 June 2016, 00:59:50

On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
> >> What exactly is the point of all of that already_marked stuff?
> >
> > Preventing the old tuple from being locked/updated by another backend,
> > while unlocking the buffer.
> >
> >> I
> >> mean, suppose we just don't do any of that before we go off to do
> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
> >> when we reacquire the page lock, we might find that somebody else has
> >> already updated the tuple, but couldn't that be handled by
> >> (approximately) looping back up to l2 just as we do in several other
> >> cases?
> >
> > We'd potentially have to undo a fair amount more work: the toasted data
> > would have to be deleted and such, just to retry. Which isn't going to
> > super easy, because all of it will be happening with the current cid (we
> > can't just increase CommandCounterIncrement() for correctness reasons).
> 
> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
> to the TOAST data, but not the other way around.  So if we change our
> mind about where to put the tuple, I don't think that requires
> re-TOASTing.

Consider what happens if we, after restarting at l2, notice that we
can't actually insert, but return in the !HeapTupleMayBeUpdated
branch. If the caller doesn't error out - and there's certainly callers
doing that - we'd "leak" a toasted datum. Unless the transaction aborts,
the toasted datum would never be cleaned up, because there's no datum
pointing to it, so no heap_delete will ever recurse into the toast
datum (via toast_delete()).

Andres

Re: Reviewing freeze map code

From

Thomas Munro

Date:

21 June 2016, 01:59:37

On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch <noah@leadboat.com> wrote:
> I agree the non-atomic, unlogged change is a problem.  A related threat
> doesn't require a torn page:
>
>   AssignTransactionId() xid=123
>   heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123);
>   some ERROR before heap_update() finishes
>   rollback;  -- xid=123
>   some backend flushes the modified page
>   immediate shutdown
>   AssignTransactionId() xid=123
>   commit;  -- xid=123
>
> If nothing wrote an xlog record that witnesses xid 123, the cluster can
> reassign it after recovery.  The failed update is now considered a successful
> update, and the row improperly becomes dead.  That's important.

I wonder if that was originally supposed to be handled with the
HEAP_XMAX_UNLOGGED flag which was removed in 11919160.  A comment in
the heap WAL logging commit f2bfe8a2 said that tqual routines would
see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging
(though I'm not sure if the tqual routines ever actually did that).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Amit Kapila

Date:

21 June 2016, 06:29:23

On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-15 08:56:52 -0400, Robert Haas wrote:
> > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> > and CTID without logging anything or clearing the all-visible flag and
> > then releases the lock on the heap page to go do some more work that
> > might even ERROR out. Only if that other work all goes OK do we
> > relock the page and perform the WAL-logged actions.
> >
> > That doesn't seem like a good idea even in existing releases, because
> > you've taken a tuple on an all-visible page and made it not
> > all-visible, and you've made a page modification that is not
> > necessarily atomic without logging it.
>
> Right, that's broken.
>
>
> > I'm not sure what to do about this: this part of the heap_update()
> > logic has been like this forever, and I assume that if it were easy to
> > refactor this away, somebody would have done it by now.
>
> Well, I think generally nobody seriously looked at actually refactoring
> heap_update(), even though that'd be a good idea. But in this instance,
> the problem seems relatively fundamental:
>
> We need to lock the origin page, to do visibility checks, etc. Then we
> need to figure out the target page. Even disregarding toasting - which
> we could be doing earlier with some refactoring - we're going to have to
> release the page level lock, to lock them in ascending order. Otherwise
> we'll risk kinda likely deadlocks.

Can we consider to use some strategy to avoid deadlocks without releasing the lock on old page? Consider if we could have a mechanism such that RelationGetBufferForTuple() will ensure that it always returns a new buffer which has targetblock greater than the old block (on which we already held a lock). I think here tricky part is whether we can get anything like that from FSM. Also, there could be cases where we need to extend the heap when there were pages in heap with space available, but we have ignored them because there block number is smaller than the block number on which we have lock.

> We also certainly don't want to nest
> the lwlocks for the toast stuff, inside a content lock for the old
> tupe's page.
>
> So far the best idea I have - and it's really not a good one - is to
> invent a new hint-bit that tells concurrent updates to acquire a
> heavyweight tuple lock, while releasing the page-level lock. If that
> hint bit does not require any other modifications - and we don't need an
> xid in xmax for this use case - that'll avoid doing all the other
> `already_marked` stuff early, which should address the correctness
> issue.

Don't we need to clear such a flag in case of error? Also don't we need to reset it later, like when modifying the old page later before WAL.

> It's kinda invasive though, and probably has performance
> implications.
>

Do you see performance implication due to requirement of heavywieht tuple lock in more cases than now or something else?

Some others ways could be:

Before releasing the lock on buffer containing old tuple, clear the VM and visibility info from page and WAL log it. I think this could impact performance depending on how frequently we need to perform this action.

Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the same in old tuple header before releasing lock on buffer and teach tqual.c to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as an indication of aborted transaction unless it is currently in-progress. Also, I think we need to clear this flag before WAL logging in heap_update.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Thomas Munro

Date:

21 June 2016, 06:38:42

On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:
>> Well, I think generally nobody seriously looked at actually refactoring
>> heap_update(), even though that'd be a good idea.  But in this instance,
>> the problem seems relatively fundamental:
>>
>> We need to lock the origin page, to do visibility checks, etc. Then we
>> need to figure out the target page. Even disregarding toasting - which
>> we could be doing earlier with some refactoring - we're going to have to
>> release the page level lock, to lock them in ascending order. Otherwise
>> we'll risk kinda likely deadlocks.
>
> Can we consider to use some strategy to avoid deadlocks without releasing
> the lock on old page?  Consider if we could have a mechanism such that
> RelationGetBufferForTuple() will ensure that it always returns a new buffer
> which has targetblock greater than the old block (on which we already held a
> lock).  I think here tricky part is whether we can get anything like that
> from FSM. Also, there could be cases where we need to extend the heap when
> there were pages in heap with space available, but we have ignored them
> because there block number is smaller than the block number on which we have
> lock.

Doesn't that mean that over time, given a workload that does only or
mostly updates, your records tend to migrate further and further away
from the start of the file, leaving a growing unusable space at the
beginning, until you eventually need to CLUSTER/VACUUM FULL?

I was wondering about speculatively asking for a free page with a
lower block number than the origin page, if one is available, before
locking the origin page.  Then after locking the origin page, if it
turns out you need a page but didn't get it earlier, asking for a free
page with a higher block number than the origin page.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Thomas Munro

Date:

21 June 2016, 06:46:12

On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Some others ways could be:
>
> Before releasing the lock on buffer containing old tuple, clear the VM and
> visibility info from page and WAL log it.  I think this could impact
> performance depending on how frequently we need to perform this action.
>
> Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was
> introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the
> same in old tuple header before releasing lock on buffer and teach tqual.c
> to honor the flag.  I think tqual.c should consider  HEAP_XMAX_UNLOGGED as
> an indication of aborted transaction unless it is currently in-progress.
> Also, I think we need to clear this flag before WAL logging in heap_update.

I also noticed that and wondered whether it was a mistake to take that
out.  It appears to have been removed as part of the logic to clear
away UNDO log support in 11919160, but it may have been an important
part of the heap_update protocol.  Though (as I mentioned nearby in a
reply to Noah) it I'm not sure if the tqual.c side which would ignore
the unlogged xmax in the event of a badly timed crash was ever
implemented.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

21 June 2016, 06:51:44

On 2016-06-21 08:59:13 +0530, Amit Kapila wrote:
> Can we consider to use some strategy to avoid deadlocks without releasing
> the lock on old page?  Consider if we could have a mechanism such that
> RelationGetBufferForTuple() will ensure that it always returns a new buffer
> which has targetblock greater than the old block (on which we already held
> a lock).  I think here tricky part is whether we can get anything like that
> from FSM. Also, there could be cases where we need to extend the heap when
> there were pages in heap with space available, but we have ignored them
> because there block number is smaller than the block number on which we
> have lock.

I can't see that being acceptable, from a space-usage POV.

> > So far the best idea I have - and it's really not a good one - is to
> > invent a new hint-bit that tells concurrent updates to acquire a
> > heavyweight tuple lock, while releasing the page-level lock. If that
> > hint bit does not require any other modifications - and we don't need an
> > xid in xmax for this use case - that'll avoid doing all the other
> > `already_marked` stuff early, which should address the correctness
> > issue.
> >
> 
> Don't we need to clear such a flag in case of error?  Also don't we need to
> reset it later, like when modifying the old page later before WAL.

If the flag just says "acquire a heavyweight lock", then there's no need
for that. That's cheap enough to just do if it's errorneously set.  At
least I can't see any reason.

> >  It's kinda invasive though, and probably has performance
> > implications.
> >
> 
> Do you see performance implication due to requirement of heavywieht tuple
> lock in more cases than now or something else?

Because of that, yes.


> Some others ways could be:
> 
> Before releasing the lock on buffer containing old tuple, clear the VM and
> visibility info from page and WAL log it.  I think this could impact
> performance depending on how frequently we need to perform this action.

Doubling the number of xlog inserts in heap_update would certainly be
measurable :(. My guess is that the heavyweight tuple lock approach will
be less expensive.

Andres

Re: Reviewing freeze map code

From

Amit Kapila

Date:

21 June 2016, 08:00:06

On Tue, Jun 21, 2016 at 9:08 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:
> >> Well, I think generally nobody seriously looked at actually refactoring
> >> heap_update(), even though that'd be a good idea. But in this instance,
> >> the problem seems relatively fundamental:
> >>
> >> We need to lock the origin page, to do visibility checks, etc. Then we
> >> need to figure out the target page. Even disregarding toasting - which
> >> we could be doing earlier with some refactoring - we're going to have to
> >> release the page level lock, to lock them in ascending order. Otherwise
> >> we'll risk kinda likely deadlocks.
> >
> > Can we consider to use some strategy to avoid deadlocks without releasing
> > the lock on old page? Consider if we could have a mechanism such that
> > RelationGetBufferForTuple() will ensure that it always returns a new buffer
> > which has targetblock greater than the old block (on which we already held a
> > lock). I think here tricky part is whether we can get anything like that
> > from FSM. Also, there could be cases where we need to extend the heap when
> > there were pages in heap with space available, but we have ignored them
> > because there block number is smaller than the block number on which we have
> > lock.
>
> Doesn't that mean that over time, given a workload that does only or
> mostly updates, your records tend to migrate further and further away
> from the start of the file, leaving a growing unusable space at the
> beginning, until you eventually need to CLUSTER/VACUUM FULL?

The request for updates should ideally fit in same page as old tuple for many of the cases if fillfactor is properly configured, considering update-mostly loads. Why would it be that always the records will migrate further away, they should get the space freed by other updates in intermediate pages. I think there could be some impact space-wise, but freed-up space will be eventually used.

> I was wondering about speculatively asking for a free page with a
> lower block number than the origin page, if one is available, before
> locking the origin page.

Do you wan't to lock it as well? In any-case, I think adding the code without deciding whether the update can be accommodated in current page can prove to be costly.

> Then after locking the origin page, if it
> turns out you need a page but didn't get it earlier, asking for a free
> page with a higher block number than the origin page.
>

Something like that might workout if it is feasible and people agree on pursuing such an approach.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Amit Kapila

Date:

21 June 2016, 08:07:45

On Tue, Jun 21, 2016 at 9:16 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Some others ways could be:
> >
> > Before releasing the lock on buffer containing old tuple, clear the VM and
> > visibility info from page and WAL log it. I think this could impact
> > performance depending on how frequently we need to perform this action.
> >
> > Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was
> > introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the
> > same in old tuple header before releasing lock on buffer and teach tqual.c
> > to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as
> > an indication of aborted transaction unless it is currently in-progress.
> > Also, I think we need to clear this flag before WAL logging in heap_update.
>
> I also noticed that and wondered whether it was a mistake to take that
> out. It appears to have been removed as part of the logic to clear
> away UNDO log support in 11919160, but it may have been an important
> part of the heap_update protocol. Though (as I mentioned nearby in a
> reply to Noah) it I'm not sure if the tqual.c side which would ignore
> the unlogged xmax in the event of a badly timed crash was ever
> implemented.
>

Right, my observation is similar to yours and that's what I am suggesting as one-alternative to fix this issue. I think making this approach work (even if this doesn't have any problems) might turn out to be tricky. However, the plus-point of this approach seems to be that it shouldn't impact performance in most of the cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Amit Kapila

Date:

21 June 2016, 08:28:15

On Tue, Jun 21, 2016 at 9:21 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-21 08:59:13 +0530, Amit Kapila wrote:
> > Can we consider to use some strategy to avoid deadlocks without releasing
> > the lock on old page? Consider if we could have a mechanism such that
> > RelationGetBufferForTuple() will ensure that it always returns a new buffer
> > which has targetblock greater than the old block (on which we already held
> > a lock). I think here tricky part is whether we can get anything like that
> > from FSM. Also, there could be cases where we need to extend the heap when
> > there were pages in heap with space available, but we have ignored them
> > because there block number is smaller than the block number on which we
> > have lock.
>
> I can't see that being acceptable, from a space-usage POV.
>
> > > So far the best idea I have - and it's really not a good one - is to
> > > invent a new hint-bit that tells concurrent updates to acquire a
> > > heavyweight tuple lock, while releasing the page-level lock. If that
> > > hint bit does not require any other modifications - and we don't need an
> > > xid in xmax for this use case - that'll avoid doing all the other
> > > `already_marked` stuff early, which should address the correctness
> > > issue.
> > >
> >
> > Don't we need to clear such a flag in case of error? Also don't we need to
> > reset it later, like when modifying the old page later before WAL.
>
> If the flag just says "acquire a heavyweight lock", then there's no need
> for that. That's cheap enough to just do if it's errorneously set. At
> least I can't see any reason.
>

I think it will just increase the chances of other backends to acquire a heavy weight lock.

> > > It's kinda invasive though, and probably has performance
> > > implications.
> > >
> >
> > Do you see performance implication due to requirement of heavywieht tuple
> > lock in more cases than now or something else?
>
> Because of that, yes.
>
>
> > Some others ways could be:
> >
> > Before releasing the lock on buffer containing old tuple, clear the VM and
> > visibility info from page and WAL log it. I think this could impact
> > performance depending on how frequently we need to perform this action.
>
> Doubling the number of xlog inserts in heap_update would certainly be
> measurable :(. My guess is that the heavyweight tuple lock approach will
> be less expensive.
>

Probably, but I think heavyweight tuple lock is more invasive. I think increasing the number of xlog inserts could surely impact performance, but depending upon how frequently we need to call it. I think we might want to combine it with the idea of RelationGetBufferForTuple() to return higher-block number, such that if we don't find higher block-number from FSM, then we can release the lock on old page and try to get the locks on old and new buffers as we do now. This will further reduce the chances of increasing xlog insert calls and address the issue of space-wastage.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Robert Haas

Date:

21 June 2016, 17:38:21

On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>> >> I
>> >> mean, suppose we just don't do any of that before we go off to do
>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>> >> when we reacquire the page lock, we might find that somebody else has
>> >> already updated the tuple, but couldn't that be handled by
>> >> (approximately) looping back up to l2 just as we do in several other
>> >> cases?
>> >
>> > We'd potentially have to undo a fair amount more work: the toasted data
>> > would have to be deleted and such, just to retry. Which isn't going to
>> > super easy, because all of it will be happening with the current cid (we
>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>
>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>> to the TOAST data, but not the other way around.  So if we change our
>> mind about where to put the tuple, I don't think that requires
>> re-TOASTing.
>
> Consider what happens if we, after restarting at l2, notice that we
> can't actually insert, but return in the !HeapTupleMayBeUpdated
> branch. If the caller doesn't error out - and there's certainly callers
> doing that - we'd "leak" a toasted datum. Unless the transaction aborts,
> the toasted datum would never be cleaned up, because there's no datum
> pointing to it, so no heap_delete will ever recurse into the toast
> datum (via toast_delete()).

OK, I see what you mean.  Still, that doesn't seem like such a
terrible cost.  If you try to update a tuple and if it looks like you
can update it but then after TOASTing you find that the status of the
tuple has changed such that you can't update it after all, then you
might need to go set xmax = MyTxid() on all of the TOAST tuples you
created (whose CTIDs we could save someplace, so that it's just a
matter of finding them by CTID to kill them).  That's not likely to
happen particularly often, though, and when it does happen it's not
insanely expensive.  We could also reduce the cost by letting the
caller of heap_update() decide whether to back out the work; if the
caller intends to throw an error anyway, then there's no point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Tom Lane

Date:

21 June 2016, 17:47:37

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote:
>> Consider what happens if we, after restarting at l2, notice that we
>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>> branch.

> OK, I see what you mean.  Still, that doesn't seem like such a
> terrible cost.  If you try to update a tuple and if it looks like you
> can update it but then after TOASTing you find that the status of the
> tuple has changed such that you can't update it after all, then you
> might need to go set xmax = MyTxid() on all of the TOAST tuples you
> created (whose CTIDs we could save someplace, so that it's just a
> matter of finding them by CTID to kill them).

... and if you get an error or crash partway through that, what happens?
        regards, tom lane

Re: Reviewing freeze map code

From

Robert Haas

Date:

21 June 2016, 17:50:47

On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund <andres@anarazel.de> wrote:
>> > So far the best idea I have - and it's really not a good one - is to
>> > invent a new hint-bit that tells concurrent updates to acquire a
>> > heavyweight tuple lock, while releasing the page-level lock. If that
>> > hint bit does not require any other modifications - and we don't need an
>> > xid in xmax for this use case - that'll avoid doing all the other
>> > `already_marked` stuff early, which should address the correctness
>> > issue.
>> >
>>
>> Don't we need to clear such a flag in case of error?  Also don't we need to
>> reset it later, like when modifying the old page later before WAL.
>
> If the flag just says "acquire a heavyweight lock", then there's no need
> for that. That's cheap enough to just do if it's errorneously set.  At
> least I can't see any reason.

I don't quite understand the intended semantics of this proposed flag.
If we don't already have the tuple lock at that point, we can't go
acquire it before releasing the content lock, can we?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Robert Haas

Date:

21 June 2016, 17:51:42

On Tue, Jun 21, 2016 at 10:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote:
>>> Consider what happens if we, after restarting at l2, notice that we
>>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>>> branch.
>
>> OK, I see what you mean.  Still, that doesn't seem like such a
>> terrible cost.  If you try to update a tuple and if it looks like you
>> can update it but then after TOASTing you find that the status of the
>> tuple has changed such that you can't update it after all, then you
>> might need to go set xmax = MyTxid() on all of the TOAST tuples you
>> created (whose CTIDs we could save someplace, so that it's just a
>> matter of finding them by CTID to kill them).
>
> ... and if you get an error or crash partway through that, what happens?

Then the transaction is aborted anyway, and we haven't leaked anything
because VACUUM will clean it up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

21 June 2016, 19:54:22

On 2016-06-21 10:50:36 -0400, Robert Haas wrote:
> On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > So far the best idea I have - and it's really not a good one - is to
> >> > invent a new hint-bit that tells concurrent updates to acquire a
> >> > heavyweight tuple lock, while releasing the page-level lock. If that
> >> > hint bit does not require any other modifications - and we don't need an
> >> > xid in xmax for this use case - that'll avoid doing all the other
> >> > `already_marked` stuff early, which should address the correctness
> >> > issue.
> >> >
> >>
> >> Don't we need to clear such a flag in case of error?  Also don't we need to
> >> reset it later, like when modifying the old page later before WAL.
> >
> > If the flag just says "acquire a heavyweight lock", then there's no need
> > for that. That's cheap enough to just do if it's errorneously set.  At
> > least I can't see any reason.
> 
> I don't quite understand the intended semantics of this proposed flag.

Whenever the flag is set, we have to acquire the heavyweight tuple lock
before continuing. That guarantees nobody else can modify the tuple,
while the lock is released, without requiring to modify more than one
hint bit.  That should fix the torn page issue, no?

> If we don't already have the tuple lock at that point, we can't go
> acquire it before releasing the content lock, can we?

Why not?  Afaics the way that tuple locks are used, the nesting should
be fine.

Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

21 June 2016, 20:03:34

On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund <andres@anarazel.de> wrote:
>> I don't quite understand the intended semantics of this proposed flag.
>
> Whenever the flag is set, we have to acquire the heavyweight tuple lock
> before continuing. That guarantees nobody else can modify the tuple,
> while the lock is released, without requiring to modify more than one
> hint bit.  That should fix the torn page issue, no?

Yeah, I guess that would work.

>> If we don't already have the tuple lock at that point, we can't go
>> acquire it before releasing the content lock, can we?
>
> Why not?  Afaics the way that tuple locks are used, the nesting should
> be fine.

Well, the existing places where we acquire the tuple lock within
heap_update() are all careful to release the page lock first, so I'm
skeptical that doing it the other order is safe.  Certainly, if we've
got some code that grabs the page lock and then the tuple lock and
other code that grabs the tuple lock and then the page lock, that's a
deadlock waiting to happen.  I'm also a bit dubious that LockAcquire
is safe to call in general with interrupts held.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

21 June 2016, 20:58:32

On 2016-06-21 13:03:24 -0400, Robert Haas wrote:
> On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund <andres@anarazel.de> wrote:
> >> I don't quite understand the intended semantics of this proposed flag.
> >
> > Whenever the flag is set, we have to acquire the heavyweight tuple lock
> > before continuing. That guarantees nobody else can modify the tuple,
> > while the lock is released, without requiring to modify more than one
> > hint bit.  That should fix the torn page issue, no?
> 
> Yeah, I guess that would work.
> 
> >> If we don't already have the tuple lock at that point, we can't go
> >> acquire it before releasing the content lock, can we?
> >
> > Why not?  Afaics the way that tuple locks are used, the nesting should
> > be fine.
> 
> Well, the existing places where we acquire the tuple lock within
> heap_update() are all careful to release the page lock first, so I'm
> skeptical that doing it the other order is safe.  Certainly, if we've
> got some code that grabs the page lock and then the tuple lock and
> other code that grabs the tuple lock and then the page lock, that's a
> deadlock waiting to happen.

Just noticed this piece of code while looking into this:    UnlockReleaseBuffer(buffer);    if (have_tuple_lock)
UnlockTupleTuplock(relation,&(tp.t_self), LockTupleExclusive);    if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);   return result;
 

seems weird to release the vmbuffer after the tuplelock...


> I'm also a bit dubious that LockAcquire is safe to call in general
> with interrupts held.

Looks like we could just acquire the tuple-lock *before* doing the
toast_insert_or_update/RelationGetBufferForTuple, but after releasing
the buffer lock. That'd allow us to do avoid doing the nested locking,
should make the recovery just a goto l2;, ...

Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

21 June 2016, 22:38:34

On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote:
>> I'm also a bit dubious that LockAcquire is safe to call in general
>> with interrupts held.
>
> Looks like we could just acquire the tuple-lock *before* doing the
> toast_insert_or_update/RelationGetBufferForTuple, but after releasing
> the buffer lock. That'd allow us to do avoid doing the nested locking,
> should make the recovery just a goto l2;, ...

Why isn't that racey?  Somebody else can grab the tuple lock after we
release the buffer content lock and before we acquire the tuple lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

21 June 2016, 22:46:32

On 2016-06-21 15:38:25 -0400, Robert Haas wrote:
> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote:
> >> I'm also a bit dubious that LockAcquire is safe to call in general
> >> with interrupts held.
> >
> > Looks like we could just acquire the tuple-lock *before* doing the
> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing
> > the buffer lock. That'd allow us to do avoid doing the nested locking,
> > should make the recovery just a goto l2;, ...
> 
> Why isn't that racey?  Somebody else can grab the tuple lock after we
> release the buffer content lock and before we acquire the tuple lock.

Sure, but by the time the tuple lock is released, they'd have updated
xmax. So once we acquired that we can just do    if (xmax_infomask_changed(oldtup.t_data->t_infomask,
          infomask) ||        !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
xwait))        goto l2;
 
which is fine, because we've not yet done the toasting.


I'm not sure wether this approach is better than deleting potentially
toasted data though. It's probably faster, but will likely touch more
places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
== HEAP_MOVED in my prototype).


Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

21 June 2016, 23:32:12

On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-21 15:38:25 -0400, Robert Haas wrote:
>> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote:
>> >> I'm also a bit dubious that LockAcquire is safe to call in general
>> >> with interrupts held.
>> >
>> > Looks like we could just acquire the tuple-lock *before* doing the
>> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing
>> > the buffer lock. That'd allow us to do avoid doing the nested locking,
>> > should make the recovery just a goto l2;, ...
>>
>> Why isn't that racey?  Somebody else can grab the tuple lock after we
>> release the buffer content lock and before we acquire the tuple lock.
>
> Sure, but by the time the tuple lock is released, they'd have updated
> xmax. So once we acquired that we can just do
>                 if (xmax_infomask_changed(oldtup.t_data->t_infomask,
>                                                                   infomask) ||
>                         !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
>                                                                  xwait))
>                         goto l2;
> which is fine, because we've not yet done the toasting.

I see.

> I'm not sure wether this approach is better than deleting potentially
> toasted data though. It's probably faster, but will likely touch more
> places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
> == HEAP_MOVED in my prototype).

Ugh.  That's not very desirable at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

24 June 2016, 01:42:43

On 2016-06-21 16:32:03 -0400, Robert Haas wrote:
> On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-06-21 15:38:25 -0400, Robert Haas wrote:
> >> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote:
> >> >> I'm also a bit dubious that LockAcquire is safe to call in general
> >> >> with interrupts held.
> >> >
> >> > Looks like we could just acquire the tuple-lock *before* doing the
> >> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing
> >> > the buffer lock. That'd allow us to do avoid doing the nested locking,
> >> > should make the recovery just a goto l2;, ...
> >>
> >> Why isn't that racey?  Somebody else can grab the tuple lock after we
> >> release the buffer content lock and before we acquire the tuple lock.
> >
> > Sure, but by the time the tuple lock is released, they'd have updated
> > xmax. So once we acquired that we can just do
> >                 if (xmax_infomask_changed(oldtup.t_data->t_infomask,
> >                                                                   infomask) ||
> >                         !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
> >                                                                  xwait))
> >                         goto l2;
> > which is fine, because we've not yet done the toasting.
> 
> I see.
> 
> > I'm not sure wether this approach is better than deleting potentially
> > toasted data though. It's probably faster, but will likely touch more
> > places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
> > == HEAP_MOVED in my prototype).
> 
> Ugh.  That's not very desirable at all.

I'm looking into three approaches right now:

1) Flag approach from above
2) Undo toasting on concurrent activity, retry
3) Use WAL logging for the already_marked = true case.

1) primarily suffers from a significant amount of complexity. I still
have a bug in there that sometimes triggers "attempted to update
invisible tuple" ERRORs.  Otherwise it seems to perform decently
performancewise - even on workloads with many backends hitting the same
tuple, the retry-rate is low.

2) Seems to work too, but due to the amount of time the tuple is not
locked, the retry rate can be really high. As we perform significant
amount of work (toast insertion & index manipulation or extending a
file) , while the tuple is not locked, it's quite likely that another
session tries to modify the tuple inbetween.  I think it's possible to
essentially livelock.

3) This approach so far seems the best. It's possible to reuse the
xl_heap_lock record (in an afaics backwards compatible manner), and in
most cases the overhead isn't that large.  It's of course annoying to
emit more WAL, but it's not that big an overhead compared to extending a
file, or to toasting.  It's also by far the simplest fix.

Comments?

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

24 June 2016, 02:00:12

Andres Freund wrote:

> I'm looking into three approaches right now:
> 
> 3) Use WAL logging for the already_marked = true case.

> 3) This approach so far seems the best. It's possible to reuse the
> xl_heap_lock record (in an afaics backwards compatible manner), and in
> most cases the overhead isn't that large.  It's of course annoying to
> emit more WAL, but it's not that big an overhead compared to extending a
> file, or to toasting.  It's also by far the simplest fix.

I suppose it's fine if we crash midway from emitting this wal record and
the actual heap_update one, since the xmax will appear to come from an
aborted xid, right?

I agree that the overhead is probably negligible, considering that this
only happens when toast is invoked.  It's probably not as great when the
new tuple goes to another page, though.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

Andres Freund

Date:

24 June 2016, 02:03:17

On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > I'm looking into three approaches right now:
> > 
> > 3) Use WAL logging for the already_marked = true case.
> 
> 
> > 3) This approach so far seems the best. It's possible to reuse the
> > xl_heap_lock record (in an afaics backwards compatible manner), and in
> > most cases the overhead isn't that large.  It's of course annoying to
> > emit more WAL, but it's not that big an overhead compared to extending a
> > file, or to toasting.  It's also by far the simplest fix.
> 
> I suppose it's fine if we crash midway from emitting this wal record and
> the actual heap_update one, since the xmax will appear to come from an
> aborted xid, right?

Yea, that should be fine.


> I agree that the overhead is probably negligible, considering that this
> only happens when toast is invoked.  It's probably not as great when the
> new tuple goes to another page, though.

I think it has to happen in both cases unfortunately. We could try to
add some optimizations (e.g. only release lock & WAL log if the target
page, via fsm, is before the current one), but I don't really want to go
there in the back branches.

Andres

Re: Reviewing freeze map code

From

Amit Kapila

Date:

24 June 2016, 05:04:30

On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
>> Andres Freund wrote:
>>
>> > I'm looking into three approaches right now:
>> >
>> > 3) Use WAL logging for the already_marked = true case.
>>
>>
>> > 3) This approach so far seems the best. It's possible to reuse the
>> > xl_heap_lock record (in an afaics backwards compatible manner), and in
>> > most cases the overhead isn't that large.  It's of course annoying to
>> > emit more WAL, but it's not that big an overhead compared to extending a
>> > file, or to toasting.  It's also by far the simplest fix.
>>

+1 for proceeding with Approach-3.

>> I suppose it's fine if we crash midway from emitting this wal record and
>> the actual heap_update one, since the xmax will appear to come from an
>> aborted xid, right?
>
> Yea, that should be fine.
>
>
>> I agree that the overhead is probably negligible, considering that this
>> only happens when toast is invoked.  It's probably not as great when the
>> new tuple goes to another page, though.
>
> I think it has to happen in both cases unfortunately. We could try to
> add some optimizations (e.g. only release lock & WAL log if the target
> page, via fsm, is before the current one), but I don't really want to go
> there in the back branches.
>

You are right, I think we can try such an optimization in Head and
that too if we see a performance hit with adding this new WAL in
heap_update.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Noah Misch

Date:

27 June 2016, 00:39:19

On Tue, Jun 21, 2016 at 10:59:25AM +1200, Thomas Munro wrote:
> On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch <noah@leadboat.com> wrote:
> > I agree the non-atomic, unlogged change is a problem.  A related threat
> > doesn't require a torn page:
> >
> >   AssignTransactionId() xid=123
> >   heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123);
> >   some ERROR before heap_update() finishes
> >   rollback;  -- xid=123
> >   some backend flushes the modified page
> >   immediate shutdown
> >   AssignTransactionId() xid=123
> >   commit;  -- xid=123
> >
> > If nothing wrote an xlog record that witnesses xid 123, the cluster can
> > reassign it after recovery.  The failed update is now considered a successful
> > update, and the row improperly becomes dead.  That's important.
> 
> I wonder if that was originally supposed to be handled with the
> HEAP_XMAX_UNLOGGED flag which was removed in 11919160.  A comment in
> the heap WAL logging commit f2bfe8a2 said that tqual routines would
> see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging
> (though I'm not sure if the tqual routines ever actually did that).

HEAP_XMAX_UNLOGGED does appear to have originated in contemplation of this
same hazard.  Looking at the three commits in "git log -S HEAP_XMAX_UNLOGGED"
(f2bfe8a b58c041 1191916), nothing ever completed the implementation by
testing for that flag.

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

28 June 2016, 14:06:59

On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>> >> What exactly is the point of all of that already_marked stuff?
>> >
>> > Preventing the old tuple from being locked/updated by another backend,
>> > while unlocking the buffer.
>> >
>> >> I
>> >> mean, suppose we just don't do any of that before we go off to do
>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>> >> when we reacquire the page lock, we might find that somebody else has
>> >> already updated the tuple, but couldn't that be handled by
>> >> (approximately) looping back up to l2 just as we do in several other
>> >> cases?
>> >
>> > We'd potentially have to undo a fair amount more work: the toasted data
>> > would have to be deleted and such, just to retry. Which isn't going to
>> > super easy, because all of it will be happening with the current cid (we
>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>
>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>> to the TOAST data, but not the other way around.  So if we change our
>> mind about where to put the tuple, I don't think that requires
>> re-TOASTing.
>
> Consider what happens if we, after restarting at l2, notice that we
> can't actually insert, but return in the !HeapTupleMayBeUpdated
> branch. If the caller doesn't error out - and there's certainly callers
> doing that - we'd "leak" a toasted datum.

Sorry for interrupt you, but I have a question about this case.
Is there case where we back to l2 after we created toasted
datum(called toast_insert_or_update)?
IIUC, after we stored toast datum we just insert heap tuple and log
WAL (or error out for some reasons).

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

28 June 2016, 17:11:44

On Tue, Jun 28, 2016 at 8:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:
>>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>>> >> What exactly is the point of all of that already_marked stuff?
>>> >
>>> > Preventing the old tuple from being locked/updated by another backend,
>>> > while unlocking the buffer.
>>> >
>>> >> I
>>> >> mean, suppose we just don't do any of that before we go off to do
>>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>>> >> when we reacquire the page lock, we might find that somebody else has
>>> >> already updated the tuple, but couldn't that be handled by
>>> >> (approximately) looping back up to l2 just as we do in several other
>>> >> cases?
>>> >
>>> > We'd potentially have to undo a fair amount more work: the toasted data
>>> > would have to be deleted and such, just to retry. Which isn't going to
>>> > super easy, because all of it will be happening with the current cid (we
>>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>>
>>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>>> to the TOAST data, but not the other way around.  So if we change our
>>> mind about where to put the tuple, I don't think that requires
>>> re-TOASTing.
>>
>> Consider what happens if we, after restarting at l2, notice that we
>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>> branch. If the caller doesn't error out - and there's certainly callers
>> doing that - we'd "leak" a toasted datum.
>
> Sorry for interrupt you, but I have a question about this case.
> Is there case where we back to l2 after we created toasted
> datum(called toast_insert_or_update)?
> IIUC, after we stored toast datum we just insert heap tuple and log
> WAL (or error out for some reasons).
>

I understood now, sorry for the noise.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

29 June 2016, 08:45:43

On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
>>> Andres Freund wrote:
>>>
>>> > I'm looking into three approaches right now:
>>> >
>>> > 3) Use WAL logging for the already_marked = true case.
>>>
>>>
>>> > 3) This approach so far seems the best. It's possible to reuse the
>>> > xl_heap_lock record (in an afaics backwards compatible manner), and in
>>> > most cases the overhead isn't that large.  It's of course annoying to
>>> > emit more WAL, but it's not that big an overhead compared to extending a
>>> > file, or to toasting.  It's also by far the simplest fix.
>>>
>
> +1 for proceeding with Approach-3.
>
>>> I suppose it's fine if we crash midway from emitting this wal record and
>>> the actual heap_update one, since the xmax will appear to come from an
>>> aborted xid, right?
>>
>> Yea, that should be fine.
>>
>>
>>> I agree that the overhead is probably negligible, considering that this
>>> only happens when toast is invoked.  It's probably not as great when the
>>> new tuple goes to another page, though.
>>
>> I think it has to happen in both cases unfortunately. We could try to
>> add some optimizations (e.g. only release lock & WAL log if the target
>> page, via fsm, is before the current one), but I don't really want to go
>> there in the back branches.
>>
>
> You are right, I think we can try such an optimization in Head and
> that too if we see a performance hit with adding this new WAL in
> heap_update.
>
>

+1 for #3 approach, and attached draft patch for that.
I think attached patch would fix this problem but please let me know
if this patch is not what you're thinking.

Regards,

--
Masahiko Sawada

Attachment

emit_wal_already_marked_true_case.patch

Re: Reviewing freeze map code

From

Amit Kapila

Date:

29 June 2016, 16:34:43

On Wed, Jun 29, 2016 at 11:14 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote:
>>> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
>>>> Andres Freund wrote:
>>>>
>>>> > I'm looking into three approaches right now:
>>>> >
>>>> > 3) Use WAL logging for the already_marked = true case.
>>>>
>>>>
>>>> > 3) This approach so far seems the best. It's possible to reuse the
>>>> > xl_heap_lock record (in an afaics backwards compatible manner), and in
>>>> > most cases the overhead isn't that large.  It's of course annoying to
>>>> > emit more WAL, but it's not that big an overhead compared to extending a
>>>> > file, or to toasting.  It's also by far the simplest fix.
>>>>
>>
>>>
>>
>> You are right, I think we can try such an optimization in Head and
>> that too if we see a performance hit with adding this new WAL in
>> heap_update.
>>
>>
>
> +1 for #3 approach, and attached draft patch for that.
> I think attached patch would fix this problem but please let me know
> if this patch is not what you're thinking.
>

Review comments:

+ if (RelationNeedsWAL(relation))
+ {
+ xl_heap_lock xlrec;
+ XLogRecPtr recptr;
+
..
+ xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+ xlrec.locking_xid = xid;
+ xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+   oldtup.t_data->t_infomask2);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+ recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+ PageSetLSN(page, recptr);
+ }

There is nothing in this record which recorded the information about
visibility clear flag.  How will you ensure to clear the flag after
crash?  Have you considered to log cid using log_heap_new_cid() for
logical decoding?

It seems to me that the value of locking_xid should be xmax_old_tuple,
why you have chosen xid?

+ /* Celar PD_ALL_VISIBLE flags */
+ if (PageIsAllVisible(BufferGetPage(buffer)))
+ {
+ all_visible_cleared = true;
+ PageClearAllVisible(BufferGetPage(buffer));
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer);
+ }
+
+ MarkBufferDirty(buffer);
+
 /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);


I think it is better to first update tuple related info and then clear
PD_ALL_VISIBLE flags (for order, refer how we have done in heap_update
in the code below where you are trying to add new code).

Couple of typo's -
/relasing/releasing
/Celar/Clear

I think in this approach, it is important to measure the performance
of update, may be you can use simple-update option of pgbench for
various workloads.  Try it with different fill factors (-F fillfactor
option in pgbench).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

29 June 2016, 20:00:59

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
> There is nothing in this record which recorded the information about
> visibility clear flag.

I think we can actually defer the clearing to the lock release? A tuple
being locked doesn't require the vm being cleared.

> I think in this approach, it is important to measure the performance
> of update, may be you can use simple-update option of pgbench for
> various workloads.  Try it with different fill factors (-F fillfactor
> option in pgbench).

Probably not sufficient, also needs toast activity, to show the really
bad case of many lock releases.

Re: Reviewing freeze map code

From

Amit Kapila

Date:

30 June 2016, 06:29:22

On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>> There is nothing in this record which recorded the information about
>> visibility clear flag.
>
> I think we can actually defer the clearing to the lock release?

How about the case if after we release the lock on page, the heap page
gets flushed, but not vm and then server crashes?  After recovery,
vacuum will never consider such a page for freezing as the vm bit
still says all_frozen.   Another possibility could be that WAL for
xl_heap_lock got flushed, but not the heap page before crash, now
after recovery, it will set the tuple with appropriate infomask and
other flags, but the heap page will still be marked as ALL_VISIBLE. I
think that can lead to a situation which Thomas Munro has reported
upthread.

All other cases in heapam.c, after clearing vm and corresponding flag
in heap page, we are recording the same in WAL.  Why to make this a
different case and how is it safe to do it here and not at other
places.

> A tuple
> being locked doesn't require the vm being cleared.
>
>
>> I think in this approach, it is important to measure the performance
>> of update, may be you can use simple-update option of pgbench for
>> various workloads.  Try it with different fill factors (-F fillfactor
>> option in pgbench).
>
> Probably not sufficient, also needs toast activity, to show the really
> bad case of many lock releases.

Okay, makes sense.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

30 June 2016, 06:43:24

On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
> >> There is nothing in this record which recorded the information about
> >> visibility clear flag.
> >
> > I think we can actually defer the clearing to the lock release?
> 
> How about the case if after we release the lock on page, the heap page
> gets flushed, but not vm and then server crashes?

In the released branches there's no need to clear all visible at that
point. Note how heap_lock_tuple doesn't clear it at all. So we should be
fine there, and that's the part where reusing an existing record is
important (for compatibility).

But your question made me realize that we despearately *do* need to
clear the frozen bit in heap_lock_tuple in 9.6...

Greetings,

Andres Freund

Re: Reviewing freeze map code

From

Amit Kapila

Date:

30 June 2016, 09:12:56

On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>> >> There is nothing in this record which recorded the information about
>> >> visibility clear flag.
>> >
>> > I think we can actually defer the clearing to the lock release?
>>
>> How about the case if after we release the lock on page, the heap page
>> gets flushed, but not vm and then server crashes?
>
> In the released branches there's no need to clear all visible at that
> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
> fine there, and that's the part where reusing an existing record is
> important (for compatibility).
>

For back branches, I agree that heap_lock_tuple is sufficient, but in
that case we should not clear the vm or page bit at all as done in
proposed patch.

> But your question made me realize that we despearately *do* need to
> clear the frozen bit in heap_lock_tuple in 9.6...
>

Right.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

30 June 2016, 17:41:31

On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
>>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
>>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>>> >> There is nothing in this record which recorded the information about
>>> >> visibility clear flag.
>>> >
>>> > I think we can actually defer the clearing to the lock release?
>>>
>>> How about the case if after we release the lock on page, the heap page
>>> gets flushed, but not vm and then server crashes?
>>
>> In the released branches there's no need to clear all visible at that
>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
>> fine there, and that's the part where reusing an existing record is
>> important (for compatibility).
>>
>
> For back branches, I agree that heap_lock_tuple is sufficient,

Even if we use heap_lock_tuple, If server crashed after flushed heap
but not vm, after crash recovery the heap is still marked all-visible
on vm.
This case could be happen even on released branched, and could make
IndexOnlyScan returns wrong result?

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Amit Kapila

Date:

01 July 2016, 05:12:49

On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
>>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
>>>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
>>>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>>>> >> There is nothing in this record which recorded the information about
>>>> >> visibility clear flag.
>>>> >
>>>> > I think we can actually defer the clearing to the lock release?
>>>>
>>>> How about the case if after we release the lock on page, the heap page
>>>> gets flushed, but not vm and then server crashes?
>>>
>>> In the released branches there's no need to clear all visible at that
>>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
>>> fine there, and that's the part where reusing an existing record is
>>> important (for compatibility).
>>>
>>
>> For back branches, I agree that heap_lock_tuple is sufficient,
>
> Even if we use heap_lock_tuple, If server crashed after flushed heap
> but not vm, after crash recovery the heap is still marked all-visible
> on vm.

So, in this case both vm and page will be marked as all_visible.

> This case could be happen even on released branched, and could make
> IndexOnlyScan returns wrong result?
>

Why do you think IndexOnlyScan will return wrong result?  If the
server crash in the way as you described, the transaction that has
made modifications will anyway be considered aborted, so the result of
IndexOnlyScan should not be wrong.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

01 July 2016, 17:23:26

On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
>>>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
>>>>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
>>>>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>>>>> >> There is nothing in this record which recorded the information about
>>>>> >> visibility clear flag.
>>>>> >
>>>>> > I think we can actually defer the clearing to the lock release?
>>>>>
>>>>> How about the case if after we release the lock on page, the heap page
>>>>> gets flushed, but not vm and then server crashes?
>>>>
>>>> In the released branches there's no need to clear all visible at that
>>>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
>>>> fine there, and that's the part where reusing an existing record is
>>>> important (for compatibility).
>>>>
>>>
>>> For back branches, I agree that heap_lock_tuple is sufficient,
>>
>> Even if we use heap_lock_tuple, If server crashed after flushed heap
>> but not vm, after crash recovery the heap is still marked all-visible
>> on vm.
>
> So, in this case both vm and page will be marked as all_visible.
>
>> This case could be happen even on released branched, and could make
>> IndexOnlyScan returns wrong result?
>>
>
> Why do you think IndexOnlyScan will return wrong result?  If the
> server crash in the way as you described, the transaction that has
> made modifications will anyway be considered aborted, so the result of
> IndexOnlyScan should not be wrong.
>

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

Regards,

--
Masahiko Sawada

Attachment

emit_wal_already_marked_true_case_v2.patch

Re: Reviewing freeze map code

From

Robert Haas

Date:

01 July 2016, 22:18:53

On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Ah, you're right, I misunderstood.
>
> Attached updated patch incorporating your comments.
> I've changed it so that heap_xlog_lock clears vm flags if page is
> marked all frozen.

I believe that this should be separated into two patches, since there
are two issues here:

1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
2. heap_update releases the buffer content lock without logging the
changes it has made.

With respect to #1, there is no need to clear the all-visible bit,
only the all-frozen bit.  However, that's a bit tricky given that we
removed PD_ALL_FROZEN.  Should we think about putting that back again?Should we just clear all-visible and call it good
enough? The only

cost of that is that vacuum will come along and mark the page
all-visible again instead of skipping it, but that's probably not an
enormous expense in most cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

01 July 2016, 22:23:57

On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Ah, you're right, I misunderstood.
> >
> > Attached updated patch incorporating your comments.
> > I've changed it so that heap_xlog_lock clears vm flags if page is
> > marked all frozen.
> 
> I believe that this should be separated into two patches, since there
> are two issues here:
> 
> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
> 2. heap_update releases the buffer content lock without logging the
> changes it has made.
> 
> With respect to #1, there is no need to clear the all-visible bit,
> only the all-frozen bit.  However, that's a bit tricky given that we
> removed PD_ALL_FROZEN.  Should we think about putting that back again?

I think it's fine to just do the vm lookup.

> Should we just clear all-visible and call it good enough?

Given that we need to do that in heap_lock_tuple, which entirely
preserves all-visible (but shouldn't preserve all-frozen), ISTM we
better find something that doesn't invalidate all-visible.


> The only
> cost of that is that vacuum will come along and mark the page
> all-visible again instead of skipping it, but that's probably not an
> enormous expense in most cases.

I think the main cost is not having the page marked as all-visible for
index-only purposes. If it's an insert mostly table, it can be a long
while till vacuum comes around.

Andres

Re: Reviewing freeze map code

From

Jim Nasby

Date:

01 July 2016, 23:42:33

On 7/1/16 2:23 PM, Andres Freund wrote:
>> > The only
>> > cost of that is that vacuum will come along and mark the page
>> > all-visible again instead of skipping it, but that's probably not an
>> > enormous expense in most cases.
> I think the main cost is not having the page marked as all-visible for
> index-only purposes. If it's an insert mostly table, it can be a long
> while till vacuum comes around.

ISTM that's something that should be addressed anyway (and separately), no?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Reviewing freeze map code

From

Andres Freund

Date:

01 July 2016, 23:43:37

On 2016-07-01 15:42:22 -0500, Jim Nasby wrote:
> On 7/1/16 2:23 PM, Andres Freund wrote:
> > > > The only
> > > > cost of that is that vacuum will come along and mark the page
> > > > all-visible again instead of skipping it, but that's probably not an
> > > > enormous expense in most cases.
> > I think the main cost is not having the page marked as all-visible for
> > index-only purposes. If it's an insert mostly table, it can be a long
> > while till vacuum comes around.
> 
> ISTM that's something that should be addressed anyway (and separately), no?

Huh? That's the current behaviour in heap_lock_tuple.

Re: Reviewing freeze map code

From

Jim Nasby

Date:

01 July 2016, 23:49:16

On 7/1/16 3:43 PM, Andres Freund wrote:
> On 2016-07-01 15:42:22 -0500, Jim Nasby wrote:
>> On 7/1/16 2:23 PM, Andres Freund wrote:
>>>>> The only
>>>>> cost of that is that vacuum will come along and mark the page
>>>>> all-visible again instead of skipping it, but that's probably not an
>>>>> enormous expense in most cases.
>>> I think the main cost is not having the page marked as all-visible for
>>> index-only purposes. If it's an insert mostly table, it can be a long
>>> while till vacuum comes around.
>>
>> ISTM that's something that should be addressed anyway (and separately), no?
>
> Huh? That's the current behaviour in heap_lock_tuple.

Oh, I was referring to autovac not being aggressive enough on 
insert-mostly tables. Certainly if there's a reasonable way to avoid 
invalidating the VM when locking a tuple that'd be good.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Reviewing freeze map code

From

Amit Kapila

Date:

02 July 2016, 06:17:46

On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
>> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> > Ah, you're right, I misunderstood.
>> >
>> > Attached updated patch incorporating your comments.
>> > I've changed it so that heap_xlog_lock clears vm flags if page is
>> > marked all frozen.
>>
>> I believe that this should be separated into two patches, since there
>> are two issues here:
>>
>> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
>> 2. heap_update releases the buffer content lock without logging the
>> changes it has made.
>>
>> With respect to #1, there is no need to clear the all-visible bit,
>> only the all-frozen bit.  However, that's a bit tricky given that we
>> removed PD_ALL_FROZEN.  Should we think about putting that back again?
>
> I think it's fine to just do the vm lookup.
>
>> Should we just clear all-visible and call it good enough?
>
> Given that we need to do that in heap_lock_tuple, which entirely
> preserves all-visible (but shouldn't preserve all-frozen), ISTM we
> better find something that doesn't invalidate all-visible.
>

Sounds logical, considering that we have a way to set all-frozen and
vacuum does that as well.  So probably either we need to have a new
API or add a new parameter to visibilitymap_clear() to indicate the
same.  If we want to go that route, isn't it better to have
PD_ALL_FROZEN as well?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Amit Kapila

Date:

02 July 2016, 06:34:22

On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Why do you think IndexOnlyScan will return wrong result?  If the
>> server crash in the way as you described, the transaction that has
>> made modifications will anyway be considered aborted, so the result of
>> IndexOnlyScan should not be wrong.
>>
>
> Ah, you're right, I misunderstood.
>
> Attached updated patch incorporating your comments.
> I've changed it so that heap_xlog_lock clears vm flags if page is
> marked all frozen.
>

I think we should make a similar change in heap_lock_tuple API as
well.  Also, currently by default heap_xlog_lock tuple tries to clear
the visibility flags, isn't it better to handle it as we do at all
other places (ex. see log_heap_update), by logging the information
about same.   I think it is always advisable to log every action we
want replay to perform.  That way, it is always easy to extend it
based on if some change is required only in certain cases, but not in
others.

Though, it is important to get the patch right, but I feel in the
meantime, it might be better to start benchmarking.  AFAIU, even if
change some part of information while WAL logging it, the benchmark
results won't be much different.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

04 July 2016, 11:45:27

On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> Why do you think IndexOnlyScan will return wrong result?  If the
>>> server crash in the way as you described, the transaction that has
>>> made modifications will anyway be considered aborted, so the result of
>>> IndexOnlyScan should not be wrong.
>>>
>>
>> Ah, you're right, I misunderstood.
>>
>> Attached updated patch incorporating your comments.
>> I've changed it so that heap_xlog_lock clears vm flags if page is
>> marked all frozen.
>>
>
> I think we should make a similar change in heap_lock_tuple API as
> well.
> Also, currently by default heap_xlog_lock tuple tries to clear
> the visibility flags, isn't it better to handle it as we do at all
> other places (ex. see log_heap_update), by logging the information
> about same.

I will deal with them.

> Though, it is important to get the patch right, but I feel in the
> meantime, it might be better to start benchmarking.  AFAIU, even if
> change some part of information while WAL logging it, the benchmark
> results won't be much different.

Okay, I will do the benchmark test as well.

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

04 July 2016, 12:02:01

On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
>>> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> > Ah, you're right, I misunderstood.
>>> >
>>> > Attached updated patch incorporating your comments.
>>> > I've changed it so that heap_xlog_lock clears vm flags if page is
>>> > marked all frozen.
>>>
>>> I believe that this should be separated into two patches, since there
>>> are two issues here:
>>>
>>> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
>>> 2. heap_update releases the buffer content lock without logging the
>>> changes it has made.
>>>
>>> With respect to #1, there is no need to clear the all-visible bit,
>>> only the all-frozen bit.  However, that's a bit tricky given that we
>>> removed PD_ALL_FROZEN.  Should we think about putting that back again?
>>
>> I think it's fine to just do the vm lookup.
>>
>>> Should we just clear all-visible and call it good enough?
>>
>> Given that we need to do that in heap_lock_tuple, which entirely
>> preserves all-visible (but shouldn't preserve all-frozen), ISTM we
>> better find something that doesn't invalidate all-visible.
>>
>
> Sounds logical, considering that we have a way to set all-frozen and
> vacuum does that as well.  So probably either we need to have a new
> API or add a new parameter to visibilitymap_clear() to indicate the
> same.  If we want to go that route, isn't it better to have
> PD_ALL_FROZEN as well?
>

Cant' we call visibilitymap_set with all-visible but not all-frozen
bits instead of clearing flags?

Regards,

--
Masahiko Sawada

Re: Reviewing freeze map code

From

Amit Kapila

Date:

04 July 2016, 15:47:21

On Mon, Jul 4, 2016 at 2:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote:
>>> On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
>>>
>>>> Should we just clear all-visible and call it good enough?
>>>
>>> Given that we need to do that in heap_lock_tuple, which entirely
>>> preserves all-visible (but shouldn't preserve all-frozen), ISTM we
>>> better find something that doesn't invalidate all-visible.
>>>
>>
>> Sounds logical, considering that we have a way to set all-frozen and
>> vacuum does that as well.  So probably either we need to have a new
>> API or add a new parameter to visibilitymap_clear() to indicate the
>> same.  If we want to go that route, isn't it better to have
>> PD_ALL_FROZEN as well?
>>
>
> Cant' we call visibilitymap_set with all-visible but not all-frozen
> bits instead of clearing flags?
>

That doesn't sound to be an impressive way to deal.  First,
visibilitymap_set logs the action itself which will generate two WAL
records (one for visibility map and another for lock tuple).  Second,
it doesn't look consistent to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

05 July 2016, 17:38:46

On Mon, Jul 4, 2016 at 5:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> Why do you think IndexOnlyScan will return wrong result?  If the
>>>> server crash in the way as you described, the transaction that has
>>>> made modifications will anyway be considered aborted, so the result of
>>>> IndexOnlyScan should not be wrong.
>>>>
>>>
>>> Ah, you're right, I misunderstood.
>>>
>>> Attached updated patch incorporating your comments.
>>> I've changed it so that heap_xlog_lock clears vm flags if page is
>>> marked all frozen.
>>>
>>
>> I think we should make a similar change in heap_lock_tuple API as
>> well.
>> Also, currently by default heap_xlog_lock tuple tries to clear
>> the visibility flags, isn't it better to handle it as we do at all
>> other places (ex. see log_heap_update), by logging the information
>> about same.
>
> I will deal with them.
>
>> Though, it is important to get the patch right, but I feel in the
>> meantime, it might be better to start benchmarking.  AFAIU, even if
>> change some part of information while WAL logging it, the benchmark
>> results won't be much different.
>
> Okay, I will do the benchmark test as well.
>

I measured the thoughput and the output quantity of WAL with HEAD and
HEAD+patch(attached) on my virtual environment.
I used pgbench with attached custom script file which sets 3200 length
string to the filler column in order to make toast data.
The scale factor is 1000 and pgbench options are, -c 4 -T 600 -f toast_test.sql.
After changed filler column to the text data type I ran it.

* Throughput
HEAD : 1833.204172
Patched : 1827.399482

* Output quantity of WAL
HEAD :  7771 MB
Patched : 8082 MB

The throughput is almost same, but the ouput quantity of WAL is
slightly increased. (about 4%)

Regards,

--
Masahiko Sawada

Than you for reviewing!

On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
>> index 57da57a..fd66527 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -3923,6 +3923,17 @@ l2:
>>
>>       if (need_toast || newtupsize > pagefree)
>>       {
>> +             /*
>> +              * To prevent data corruption due to updating old tuple by
>> +              * other backends after released buffer
>
> That's not really the reason, is it? The prime problem is crash safety /
> replication. The row-lock we're faking (by setting xmax to our xid),
> prevents concurrent updates until this backend died.

Fixed.

>>                 , we need to emit that
>> +              * xmax of old tuple is set and clear visibility map bits if
>> +              * needed before releasing buffer. We can reuse xl_heap_lock
>> +              * for this purpose. It should be fine even if we crash midway
>> +              * from this section and the actual updating one later, since
>> +              * the xmax will appear to come from an aborted xid.
>> +              */
>> +             START_CRIT_SECTION();
>> +
>>               /* Clear obsolete visibility flags ... */
>>               oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>>               oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>> @@ -3936,6 +3947,46 @@ l2:
>>               /* temporarily make it look not-updated */
>>               oldtup.t_data->t_ctid = oldtup.t_self;
>>               already_marked = true;
>> +
>> +             /* Clear PD_ALL_VISIBLE flags */
>> +             if (PageIsAllVisible(BufferGetPage(buffer)))
>> +             {
>> +                     all_visible_cleared = true;
>> +                     PageClearAllVisible(BufferGetPage(buffer));
>> +                     visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
>> +                                                             vmbuffer);
>> +             }
>> +
>> +             MarkBufferDirty(buffer);
>> +
>> +             if (RelationNeedsWAL(relation))
>> +             {
>> +                     xl_heap_lock xlrec;
>> +                     XLogRecPtr recptr;
>> +
>> +                     /*
>> +                      * For logical decoding we need combocids to properly decode the
>> +                      * catalog.
>> +                      */
>> +                     if (RelationIsAccessibleInLogicalDecoding(relation))
>> +                             log_heap_new_cid(relation, &oldtup);
>
> Hm, I don't see that being necessary here. Row locks aren't logically
> decoded, so there's no need to emit this here.

Fixed.

>
>> +     /* Clear PD_ALL_VISIBLE flags */
>> +     if (PageIsAllVisible(page))
>> +     {
>> +             Buffer  vmbuffer = InvalidBuffer;
>> +             BlockNumber     block = BufferGetBlockNumber(*buffer);
>> +
>> +             all_visible_cleared = true;
>> +             PageClearAllVisible(page);
>> +             visibilitymap_pin(relation, block, &vmbuffer);
>> +             visibilitymap_clear(relation, block, vmbuffer);
>> +     }
>> +
>
> That clears all-visible unnecessarily, we only need to clear all-frozen.
>

Fixed.

>
>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>               }
>>               HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>               HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>> +
>> +             /* The visibility map need to be cleared */
>> +             if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
>> +             {
>> +                     RelFileNode     rnode;
>> +                     Buffer          vmbuffer = InvalidBuffer;
>> +                     BlockNumber     blkno;
>> +                     Relation        reln;
>> +
>> +                     XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
>> +                     reln = CreateFakeRelcacheEntry(rnode);
>> +
>> +                     visibilitymap_pin(reln, blkno, &vmbuffer);
>> +                     visibilitymap_clear(reln, blkno, vmbuffer);
>> +                     PageClearAllVisible(page);
>> +             }
>> +
>
>
>>               PageSetLSN(page, lsn);
>>               MarkBufferDirty(buffer);
>>       }
>> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
>> index a822d0b..41b3c54 100644
>> --- a/src/include/access/heapam_xlog.h
>> +++ b/src/include/access/heapam_xlog.h
>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
>>  #define XLHL_XMAX_EXCL_LOCK          0x04
>>  #define XLHL_XMAX_KEYSHR_LOCK        0x08
>>  #define XLHL_KEYS_UPDATED            0x10
>> +#define XLHL_ALL_VISIBLE_CLEARED 0x20
>
> Hm. We can't easily do that in the back-patched version; because a
> standby won't know to check for the flag . That's kinda ok, since we
> don't yet need to clear all-visible yet at that point of
> heap_update. But that better means we don't do so on the master either.
>

Attached latest version patch.
I changed visibilitymap_clear function so that it allows to specify
bits being cleared.
The function that needs to clear the only all-frozen bit on visibility
map calls visibilitymap_clear_extended function to clear particular
bit.
Other function can call visibilitymap_clear function to clear all bits
for one page.

Instead of adding XLHL_ALL_VISIBLE_CLEARED, we do vm loop up for back branches.
To reduce unnecessary looking up visibility map, I changed it so that
we check the PD_ALL_VISIBLE on heap page, and then look up all-frozen
bit on visibility map if necessary.

Regards,

--
Masahiko Sawada

Attachment

emit_wal_already_marked_true_case_v4.patch

Re: Reviewing freeze map code

From

Robert Haas

Date:

07 July 2016, 17:37:26

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:
> Hm. We can't easily do that in the back-patched version; because a
> standby won't know to check for the flag . That's kinda ok, since we
> don't yet need to clear all-visible yet at that point of
> heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Alvaro Herrera

Date:

07 July 2016, 17:53:29

Robert Haas wrote:
> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hm. We can't easily do that in the back-patched version; because a
> > standby won't know to check for the flag . That's kinda ok, since we
> > don't yet need to clear all-visible yet at that point of
> > heap_update. But that better means we don't do so on the master either.
> 
> Is there any reason to back-patch this in the first place?

Wasn't this determined to be a pre-existing bug?  I think the
probability of occurrence has increased, but it's still possible in
earlier releases.  I wonder if there are unexplained bugs that can be
traced down to this.

I'm not really following this (sorry about that) but I wonder if (in
back branches) the failure to propagate in case the standby wasn't
updated can cause actual problems.  If it does, maybe it'd be a better
idea to have a new WAL record type instead of piggybacking on lock
tuple.  Then again, apparently the probability of this bug is low enough
that we shouldn't sweat over it ... Moreso considering Robert's apparent
opinion that perhaps we shouldn't backpatch at all in the first place.

In any case I would like to see much more commentary in the patch next
to the new XLHL flag.  It's sufficiently different than the rest than it
deserves so, IMO.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reviewing freeze map code

From

Robert Haas

Date:

07 July 2016, 18:10:18

On Thu, Jul 7, 2016 at 10:53 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hm. We can't easily do that in the back-patched version; because a
>> > standby won't know to check for the flag . That's kinda ok, since we
>> > don't yet need to clear all-visible yet at that point of
>> > heap_update. But that better means we don't do so on the master either.
>>
>> Is there any reason to back-patch this in the first place?
>
> Wasn't this determined to be a pre-existing bug?  I think the
> probability of occurrence has increased, but it's still possible in
> earlier releases.  I wonder if there are unexplained bugs that can be
> traced down to this.
>
> I'm not really following this (sorry about that) but I wonder if (in
> back branches) the failure to propagate in case the standby wasn't
> updated can cause actual problems.  If it does, maybe it'd be a better
> idea to have a new WAL record type instead of piggybacking on lock
> tuple.  Then again, apparently the probability of this bug is low enough
> that we shouldn't sweat over it ... Moreso considering Robert's apparent
> opinion that perhaps we shouldn't backpatch at all in the first place.
>
> In any case I would like to see much more commentary in the patch next
> to the new XLHL flag.  It's sufficiently different than the rest than it
> deserves so, IMO.

There are two issues being discussed on this thread.  One of them is a
new issue in 9.6: heap_lock_tuple needs to clear the all-frozen bit in
the freeze map even though it does not clear all-visible.  The one
that's actually a preexisting bug is that we can start to update a
tuple without WAL-logging anything and then release the page lock in
order to go perform TOAST insertions.  At this point, other backends
(on the master) will see this tuple as in the process of being updated
because xmax has been set and ctid has been made to point back to the
same tuple.

I'm guessing that if the UPDATE goes on to complete, any discrepancy
between the master and the standby is erased by the replay of the WAL
record covering the update itself.  I haven't checked that, but it
seems like that WAL record must set both xmax and ctid appropriately
or we'd be in big trouble.  The infomask bits are in play too, but
presumably the update's WAL is going to set those correctly also.  So
in this case I don't think there's really any issue for the standby.
Or for the master, either: it may technically be true the tuple is not
all-visible any more, but the only backend that could potentially fail
to see it is the one performing the update, and no user code can run
in the middle of toast_insert_or_update, so I think we're OK.

On the other hand, if the UPDATE aborts, there's now a persistent
difference between the master and standby: the infomask, xmax, and
ctid of the tuple may differ.  I don't know whether that could cause
any problem.  It's probably a very rare case, because there aren't all
that many things that will cause us to abort in the middle of
inserting TOAST tuples.  Out of disk space comes to mind, or maybe
some kind of corruption that throws an elog().

As far as back-patching goes, the question is whether it's worth the
risk.  Introducing new WAL logging at this point could certainly cause
performance problems if nothing else, never mind the risk of
garden-variety bugs.  I'm not sure it's worth taking that risk in
released branches for the sake of a bug which has existed for a decade
without anybody finding it until now.  I'm not going to argue strongly
for that position, but I think it's worth thinking about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 July 2016, 20:58:48

On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hm. We can't easily do that in the back-patched version; because a
> > standby won't know to check for the flag . That's kinda ok, since we
> > don't yet need to clear all-visible yet at that point of
> > heap_update. But that better means we don't do so on the master either.
> 
> Is there any reason to back-patch this in the first place?

It seems not unlikely that this has caused corruption in the past; and
that we chalked it up to hardware corruption or something. Both toasting
and file extension frequently take extended amounts of time under load,
the window for crashing in the wrong moment isn't small...

Andres

Re: Reviewing freeze map code

From

Robert Haas

Date:

07 July 2016, 21:01:13

On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
>> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hm. We can't easily do that in the back-patched version; because a
>> > standby won't know to check for the flag . That's kinda ok, since we
>> > don't yet need to clear all-visible yet at that point of
>> > heap_update. But that better means we don't do so on the master either.
>>
>> Is there any reason to back-patch this in the first place?
>
> It seems not unlikely that this has caused corruption in the past; and
> that we chalked it up to hardware corruption or something. Both toasting
> and file extension frequently take extended amounts of time under load,
> the window for crashing in the wrong moment isn't small...

Yeah, that's true, but I'm having a bit of trouble imagining exactly
we end up with corruption that actually matters.  I guess a torn page
could do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Andres Freund

Date:

07 July 2016, 21:05:05

On 2016-07-07 14:01:05 -0400, Robert Haas wrote:
> On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
> >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > Hm. We can't easily do that in the back-patched version; because a
> >> > standby won't know to check for the flag . That's kinda ok, since we
> >> > don't yet need to clear all-visible yet at that point of
> >> > heap_update. But that better means we don't do so on the master either.
> >>
> >> Is there any reason to back-patch this in the first place?
> >
> > It seems not unlikely that this has caused corruption in the past; and
> > that we chalked it up to hardware corruption or something. Both toasting
> > and file extension frequently take extended amounts of time under load,
> > the window for crashing in the wrong moment isn't small...
> 
> Yeah, that's true, but I'm having a bit of trouble imagining exactly
> we end up with corruption that actually matters.  I guess a torn page
> could do it.

I think Noah pointed out a bad scenario: If we crash after putting the
xid in the page header, but before WAL logging, the xid could get reused
after the crash. By a different transaction. And suddenly the row isn't
visible anymore, after the reused xid commits...

Re: Reviewing freeze map code

From

Robert Haas

Date:

07 July 2016, 21:29:20

On Thu, Jul 7, 2016 at 2:04 PM, Andres Freund <andres@anarazel.de> wrote:
>> Yeah, that's true, but I'm having a bit of trouble imagining exactly
>> we end up with corruption that actually matters.  I guess a torn page
>> could do it.
>
> I think Noah pointed out a bad scenario: If we crash after putting the
> xid in the page header, but before WAL logging, the xid could get reused
> after the crash. By a different transaction. And suddenly the row isn't
> visible anymore, after the reused xid commits...

Oh, wow.  Yikes.  OK, so I guess we should try to back-patch the fix, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Amit Kapila

Date:

08 July 2016, 16:24:48

On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Than you for reviewing!
>
> On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
>>> index 57da57a..fd66527 100644
>>> --- a/src/backend/access/heap/heapam.c
>>> +++ b/src/backend/access/heap/heapam.c
>>> @@ -3923,6 +3923,17 @@ l2:
>>>
>>>       if (need_toast || newtupsize > pagefree)
>>>       {
>>> +             /*
>>> +              * To prevent data corruption due to updating old tuple by
>>> +              * other backends after released buffer
>>
>> That's not really the reason, is it? The prime problem is crash safety /
>> replication. The row-lock we're faking (by setting xmax to our xid),
>> prevents concurrent updates until this backend died.
>
> Fixed.
>
>>>                 , we need to emit that
>>> +              * xmax of old tuple is set and clear visibility map bits if
>>> +              * needed before releasing buffer. We can reuse xl_heap_lock
>>> +              * for this purpose. It should be fine even if we crash midway
>>> +              * from this section and the actual updating one later, since
>>> +              * the xmax will appear to come from an aborted xid.
>>> +              */
>>> +             START_CRIT_SECTION();
>>> +
>>>               /* Clear obsolete visibility flags ... */
>>>               oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>>>               oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>>> @@ -3936,6 +3947,46 @@ l2:
>>>               /* temporarily make it look not-updated */
>>>               oldtup.t_data->t_ctid = oldtup.t_self;
>>>               already_marked = true;
>>> +
>>> +             /* Clear PD_ALL_VISIBLE flags */
>>> +             if (PageIsAllVisible(BufferGetPage(buffer)))
>>> +             {
>>> +                     all_visible_cleared = true;
>>> +                     PageClearAllVisible(BufferGetPage(buffer));
>>> +                     visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
>>> +                                                             vmbuffer);
>>> +             }
>>> +
>>> +             MarkBufferDirty(buffer);
>>> +
>>> +             if (RelationNeedsWAL(relation))
>>> +             {
>>> +                     xl_heap_lock xlrec;
>>> +                     XLogRecPtr recptr;
>>> +
>>> +                     /*
>>> +                      * For logical decoding we need combocids to properly decode the
>>> +                      * catalog.
>>> +                      */
>>> +                     if (RelationIsAccessibleInLogicalDecoding(relation))
>>> +                             log_heap_new_cid(relation, &oldtup);
>>
>> Hm, I don't see that being necessary here. Row locks aren't logically
>> decoded, so there's no need to emit this here.
>
> Fixed.
>
>>
>>> +     /* Clear PD_ALL_VISIBLE flags */
>>> +     if (PageIsAllVisible(page))
>>> +     {
>>> +             Buffer  vmbuffer = InvalidBuffer;
>>> +             BlockNumber     block = BufferGetBlockNumber(*buffer);
>>> +
>>> +             all_visible_cleared = true;
>>> +             PageClearAllVisible(page);
>>> +             visibilitymap_pin(relation, block, &vmbuffer);
>>> +             visibilitymap_clear(relation, block, vmbuffer);
>>> +     }
>>> +
>>
>> That clears all-visible unnecessarily, we only need to clear all-frozen.
>>
>
> Fixed.
>
>>
>>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>>               }
>>>               HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>>               HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>>> +
>>> +             /* The visibility map need to be cleared */
>>> +             if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
>>> +             {
>>> +                     RelFileNode     rnode;
>>> +                     Buffer          vmbuffer = InvalidBuffer;
>>> +                     BlockNumber     blkno;
>>> +                     Relation        reln;
>>> +
>>> +                     XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
>>> +                     reln = CreateFakeRelcacheEntry(rnode);
>>> +
>>> +                     visibilitymap_pin(reln, blkno, &vmbuffer);
>>> +                     visibilitymap_clear(reln, blkno, vmbuffer);
>>> +                     PageClearAllVisible(page);
>>> +             }
>>> +
>>
>>
>>>               PageSetLSN(page, lsn);
>>>               MarkBufferDirty(buffer);
>>>       }
>>> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
>>> index a822d0b..41b3c54 100644
>>> --- a/src/include/access/heapam_xlog.h
>>> +++ b/src/include/access/heapam_xlog.h
>>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
>>>  #define XLHL_XMAX_EXCL_LOCK          0x04
>>>  #define XLHL_XMAX_KEYSHR_LOCK        0x08
>>>  #define XLHL_KEYS_UPDATED            0x10
>>> +#define XLHL_ALL_VISIBLE_CLEARED 0x20
>>
>> Hm. We can't easily do that in the back-patched version; because a
>> standby won't know to check for the flag . That's kinda ok, since we
>> don't yet need to clear all-visible yet at that point of
>> heap_update. But that better means we don't do so on the master either.
>>
>
> Attached latest version patch.

+ /* Clear only the all-frozen bit on visibility map if needed */

+ if (PageIsAllVisible(BufferGetPage(buffer)) &&

+ VM_ALL_FROZEN(relation, block, &vmbuffer))
+ {
+ visibilitymap_clear_extended(relation, block, vmbuffer,
+ VISIBILITYMAP_ALL_FROZEN);
+ }
+

+ if (RelationNeedsWAL(relation))
+ {
..

+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+ xlrec.locking_xid = xmax_old_tuple;
+ xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+   oldtup.t_data->t_infomask2);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+ recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
..

One thing that looks awkward in this code is that it doesn't record
whether the frozen bit is actually cleared during the actual operation
and then during replay, it always clear the frozen bit irrespective of
whether it has been cleared by the actual operation or not.

+ /* Clear only the all-frozen bit on visibility map if needed */
+ if (PageIsAllVisible(page) &&
+ VM_ALL_FROZEN(relation, BufferGetBlockNumber(*buffer), &vmbuffer))
+ {
+ BlockNumber block = BufferGetBlockNumber(*buffer);
+
+ visibilitymap_pin(relation, block, &vmbuffer);

I think it is not right to call visibilitymap_pin after holding a
buffer lock (visibilitymap_pin can perform I/O).  Refer heap_update
for how to pin the visibility map.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Masahiko Sawada

Date:

11 July 2016, 17:51:52

On Fri, Jul 8, 2016 at 10:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Than you for reviewing!
>>
>> On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote:
>>> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>>>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
>>>> index 57da57a..fd66527 100644
>>>> --- a/src/backend/access/heap/heapam.c
>>>> +++ b/src/backend/access/heap/heapam.c
>>>> @@ -3923,6 +3923,17 @@ l2:
>>>>
>>>>       if (need_toast || newtupsize > pagefree)
>>>>       {
>>>> +             /*
>>>> +              * To prevent data corruption due to updating old tuple by
>>>> +              * other backends after released buffer
>>>
>>> That's not really the reason, is it? The prime problem is crash safety /
>>> replication. The row-lock we're faking (by setting xmax to our xid),
>>> prevents concurrent updates until this backend died.
>>
>> Fixed.
>>
>>>>                 , we need to emit that
>>>> +              * xmax of old tuple is set and clear visibility map bits if
>>>> +              * needed before releasing buffer. We can reuse xl_heap_lock
>>>> +              * for this purpose. It should be fine even if we crash midway
>>>> +              * from this section and the actual updating one later, since
>>>> +              * the xmax will appear to come from an aborted xid.
>>>> +              */
>>>> +             START_CRIT_SECTION();
>>>> +
>>>>               /* Clear obsolete visibility flags ... */
>>>>               oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>>>>               oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>>>> @@ -3936,6 +3947,46 @@ l2:
>>>>               /* temporarily make it look not-updated */
>>>>               oldtup.t_data->t_ctid = oldtup.t_self;
>>>>               already_marked = true;
>>>> +
>>>> +             /* Clear PD_ALL_VISIBLE flags */
>>>> +             if (PageIsAllVisible(BufferGetPage(buffer)))
>>>> +             {
>>>> +                     all_visible_cleared = true;
>>>> +                     PageClearAllVisible(BufferGetPage(buffer));
>>>> +                     visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
>>>> +                                                             vmbuffer);
>>>> +             }
>>>> +
>>>> +             MarkBufferDirty(buffer);
>>>> +
>>>> +             if (RelationNeedsWAL(relation))
>>>> +             {
>>>> +                     xl_heap_lock xlrec;
>>>> +                     XLogRecPtr recptr;
>>>> +
>>>> +                     /*
>>>> +                      * For logical decoding we need combocids to properly decode the
>>>> +                      * catalog.
>>>> +                      */
>>>> +                     if (RelationIsAccessibleInLogicalDecoding(relation))
>>>> +                             log_heap_new_cid(relation, &oldtup);
>>>
>>> Hm, I don't see that being necessary here. Row locks aren't logically
>>> decoded, so there's no need to emit this here.
>>
>> Fixed.
>>
>>>
>>>> +     /* Clear PD_ALL_VISIBLE flags */
>>>> +     if (PageIsAllVisible(page))
>>>> +     {
>>>> +             Buffer  vmbuffer = InvalidBuffer;
>>>> +             BlockNumber     block = BufferGetBlockNumber(*buffer);
>>>> +
>>>> +             all_visible_cleared = true;
>>>> +             PageClearAllVisible(page);
>>>> +             visibilitymap_pin(relation, block, &vmbuffer);
>>>> +             visibilitymap_clear(relation, block, vmbuffer);
>>>> +     }
>>>> +
>>>
>>> That clears all-visible unnecessarily, we only need to clear all-frozen.
>>>
>>
>> Fixed.
>>
>>>
>>>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>>>               }
>>>>               HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>>>               HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>>>> +
>>>> +             /* The visibility map need to be cleared */
>>>> +             if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
>>>> +             {
>>>> +                     RelFileNode     rnode;
>>>> +                     Buffer          vmbuffer = InvalidBuffer;
>>>> +                     BlockNumber     blkno;
>>>> +                     Relation        reln;
>>>> +
>>>> +                     XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
>>>> +                     reln = CreateFakeRelcacheEntry(rnode);
>>>> +
>>>> +                     visibilitymap_pin(reln, blkno, &vmbuffer);
>>>> +                     visibilitymap_clear(reln, blkno, vmbuffer);
>>>> +                     PageClearAllVisible(page);
>>>> +             }
>>>> +
>>>
>>>
>>>>               PageSetLSN(page, lsn);
>>>>               MarkBufferDirty(buffer);
>>>>       }
>>>> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
>>>> index a822d0b..41b3c54 100644
>>>> --- a/src/include/access/heapam_xlog.h
>>>> +++ b/src/include/access/heapam_xlog.h
>>>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
>>>>  #define XLHL_XMAX_EXCL_LOCK          0x04
>>>>  #define XLHL_XMAX_KEYSHR_LOCK        0x08
>>>>  #define XLHL_KEYS_UPDATED            0x10
>>>> +#define XLHL_ALL_VISIBLE_CLEARED 0x20
>>>
>>> Hm. We can't easily do that in the back-patched version; because a
>>> standby won't know to check for the flag . That's kinda ok, since we
>>> don't yet need to clear all-visible yet at that point of
>>> heap_update. But that better means we don't do so on the master either.
>>>
>>
>> Attached latest version patch.
>
> + /* Clear only the all-frozen bit on visibility map if needed */
>
> + if (PageIsAllVisible(BufferGetPage(buffer)) &&
>
> + VM_ALL_FROZEN(relation, block, &vmbuffer))
> + {
> + visibilitymap_clear_extended(relation, block, vmbuffer,
> + VISIBILITYMAP_ALL_FROZEN);
> + }
> +
>
> + if (RelationNeedsWAL(relation))
> + {
> ..
>
> + XLogBeginInsert();
> + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
> +
> + xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
> + xlrec.locking_xid = xmax_old_tuple;
> + xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
> +   oldtup.t_data->t_infomask2);
> + XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
> + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
> ..
>
> One thing that looks awkward in this code is that it doesn't record
> whether the frozen bit is actually cleared during the actual operation
> and then during replay, it always clear the frozen bit irrespective of
> whether it has been cleared by the actual operation or not.
>

I changed it so that we look all-frozen bit up first, and then clear
it if needed.

> + /* Clear only the all-frozen bit on visibility map if needed */
> + if (PageIsAllVisible(page) &&
> + VM_ALL_FROZEN(relation, BufferGetBlockNumber(*buffer), &vmbuffer))
> + {
> + BlockNumber block = BufferGetBlockNumber(*buffer);
> +
> + visibilitymap_pin(relation, block, &vmbuffer);
>
> I think it is not right to call visibilitymap_pin after holding a
> buffer lock (visibilitymap_pin can perform I/O).  Refer heap_update
> for how to pin the visibility map.
>

Thank you for your advice!
Fixed.

Attached separated two patched, please give me feedbacks.

Regards,

--
Masahiko Sawada

On 2016-07-16 10:45:26 -0700, Andres Freund wrote:
>
>
> On July 16, 2016 8:49:06 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >Amit Kapila <amit.kapila16@gmail.com> writes:
> >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de>
> >wrote:
> >>> I think we have two choices how to deal with that: First, we can add
> >a
> >>> new flags variable to xl_heap_lock similar to
> >>> xl_heap_insert/update/... and bump page magic,
> >
> >> +1 for going in this way.  This will keep us consistent with how
> >clear
> >> the visibility info in other places like heap_xlog_update().
> >
> >Yeah.  We've already forced a catversion bump for beta3, and I'm about
> >to go fix PG_CONTROL_VERSION as well, so there's basically no downside
> >to doing an xlog version bump as well.  At least, not if you can get it
> >in before Monday.
>
> OK, Cool. Will do it later today.

Took till today. Attached is a rather heavily revised version of
Sawada-san's patch. Most notably the recovery routines take care to
reset the vm in all cases, we don't perform visibilitymap_get_status
from inside a critical section anymore, and
heap_lock_updated_tuple_rec() also resets the vm (although I'm not
entirely sure that can practically be hit).

I'm doing some more testing, and Robert said he could take a quick look
at the patch. If somebody else... Will push sometime after dinner.

Regards,

Andres

Attachment

0001-Clear-all-frozen-visibilitymap-status-when-locking-t.patch

Re: Reviewing freeze map code

From

Robert Haas

Date:

18 July 2016, 06:34:12

On Sun, Jul 17, 2016 at 10:48 PM, Andres Freund <andres@anarazel.de> wrote:
> Took till today. Attached is a rather heavily revised version of
> Sawada-san's patch. Most notably the recovery routines take care to
> reset the vm in all cases, we don't perform visibilitymap_get_status
> from inside a critical section anymore, and
> heap_lock_updated_tuple_rec() also resets the vm (although I'm not
> entirely sure that can practically be hit).
>
> I'm doing some more testing, and Robert said he could take a quick look
> at the patch. If somebody else... Will push sometime after dinner.

Thanks very much for working on this.  Random suggestions after a quick look:

+     * Before locking the buffer, pin the visibility map page if it may be
+     * necessary.

s/necessary/needed/

More substantively, what happens if the situation changes before we
obtain the buffer lock?  I think you need to release the page lock,
pin the page after all, and then relock the page.

There seem to be several ways to escape from this function without
releasing the pin on vmbuffer.  From the visibilitymap_pin call here,
search downward for "return".

+ *  visibilitymap_clear - clear bit(s) for one page in visibility map

I don't really like the parenthesized-s convention as a shorthand for
"one or more".  It tends to confuse non-native English speakers.

+ * any I/O.  Returns whether any bits have been cleared.

I suggest "Returns true if any bits have been cleared and false otherwise".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Amit Kapila

Date:

18 July 2016, 06:37:26

On Mon, Jul 18, 2016 at 8:18 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-07-16 10:45:26 -0700, Andres Freund wrote:
>>
>>
>> On July 16, 2016 8:49:06 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >Amit Kapila <amit.kapila16@gmail.com> writes:
>> >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de>
>> >wrote:
>> >>> I think we have two choices how to deal with that: First, we can add
>> >a
>> >>> new flags variable to xl_heap_lock similar to
>> >>> xl_heap_insert/update/... and bump page magic,
>> >
>> >> +1 for going in this way.  This will keep us consistent with how
>> >clear
>> >> the visibility info in other places like heap_xlog_update().
>> >
>> >Yeah.  We've already forced a catversion bump for beta3, and I'm about
>> >to go fix PG_CONTROL_VERSION as well, so there's basically no downside
>> >to doing an xlog version bump as well.  At least, not if you can get it
>> >in before Monday.
>>
>> OK, Cool. Will do it later today.
>
> Took till today. Attached is a rather heavily revised version of
> Sawada-san's patch. Most notably the recovery routines take care to
> reset the vm in all cases, we don't perform visibilitymap_get_status
> from inside a critical section anymore, and
> heap_lock_updated_tuple_rec() also resets the vm (although I'm not
> entirely sure that can practically be hit).
>


@@ -4563,8 +4579,18 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,

+ /*
+ * Before locking the buffer, pin the visibility map page if it may be
+ * necessary.
+ */

+ if (PageIsAllVisible(BufferGetPage(*buffer)))
+ visibilitymap_pin(relation, block, &vmbuffer);
+ LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);

I think we need to check for PageIsAllVisible and try to pin the
visibility map after taking the lock on buffer. I think it is quite
possible that in the time this routine tries to acquire lock on
buffer, the page becomes all visible.  To avoid the similar hazard, we
do try to check the visibility of page after acquiring buffer lock in
heap_update() at below place.

if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))


Similarly, I think heap_lock_updated_tuple_rec() needs to take care of
same.  While I was typing this e-mail, it seems Robert has already
pointed the same issue.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

18 July 2016, 06:42:22

On 2016-07-17 23:34:01 -0400, Robert Haas wrote:
> Thanks very much for working on this.  Random suggestions after a quick look:
> 
> +     * Before locking the buffer, pin the visibility map page if it may be
> +     * necessary.
> 
> s/necessary/needed/
> 
> More substantively, what happens if the situation changes before we
> obtain the buffer lock?  I think you need to release the page lock,
> pin the page after all, and then relock the page.

It shouldn't be able to. Cleanup locks, which are required for
vacuumlazy to do anything relevant, aren't possible with the buffer
pinned.  This pattern is used in heap_delete/heap_update, so I think
we're on a reasonably well trodden path.

> There seem to be several ways to escape from this function without
> releasing the pin on vmbuffer.  From the visibilitymap_pin call here,
> search downward for "return".

Hm, that's cleary not good.

The best thing to address that seems to be to create a
separate jump label, which check vmbuffer and releases the page
lock. Unless you have a better idea.

> + *  visibilitymap_clear - clear bit(s) for one page in visibility map
> 
> I don't really like the parenthesized-s convention as a shorthand for
> "one or more".  It tends to confuse non-native English speakers.
> 
> + * any I/O.  Returns whether any bits have been cleared.
> 
> I suggest "Returns true if any bits have been cleared and false otherwise".

Will change.

- Andres

Re: Reviewing freeze map code

From

Andres Freund

Date:

18 July 2016, 06:43:57

On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
> + /*
> + * Before locking the buffer, pin the visibility map page if it may be
> + * necessary.
> + */
> 
> + if (PageIsAllVisible(BufferGetPage(*buffer)))
> + visibilitymap_pin(relation, block, &vmbuffer);
> +
>   LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
> 
> I think we need to check for PageIsAllVisible and try to pin the
> visibility map after taking the lock on buffer. I think it is quite
> possible that in the time this routine tries to acquire lock on
> buffer, the page becomes all visible.

I don't see how. Without a cleanup lock it's not possible to mark a page
all-visible/frozen. We might miss the bit becoming unset concurrently,
but that's ok.

Andres

Re: Reviewing freeze map code

From

Amit Kapila

Date:

18 July 2016, 07:33:03

On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
>> + /*
>> + * Before locking the buffer, pin the visibility map page if it may be
>> + * necessary.
>> + */
>>
>> + if (PageIsAllVisible(BufferGetPage(*buffer)))
>> + visibilitymap_pin(relation, block, &vmbuffer);
>> +
>>   LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
>>
>> I think we need to check for PageIsAllVisible and try to pin the
>> visibility map after taking the lock on buffer. I think it is quite
>> possible that in the time this routine tries to acquire lock on
>> buffer, the page becomes all visible.
>
> I don't see how. Without a cleanup lock it's not possible to mark a page
> all-visible/frozen.
>

Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reviewing freeze map code

From

Andres Freund

Date:

18 July 2016, 11:33:19

On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
> On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
> >> + /*
> >> + * Before locking the buffer, pin the visibility map page if it may be
> >> + * necessary.
> >> + */
> >>
> >> + if (PageIsAllVisible(BufferGetPage(*buffer)))
> >> + visibilitymap_pin(relation, block, &vmbuffer);
> >> +
> >>   LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
> >>
> >> I think we need to check for PageIsAllVisible and try to pin the
> >> visibility map after taking the lock on buffer. I think it is quite
> >> possible that in the time this routine tries to acquire lock on
> >> buffer, the page becomes all visible.
> >
> > I don't see how. Without a cleanup lock it's not possible to mark a page
> > all-visible/frozen.
> >
> 
> Consider the below scenario.
> 
> Vacuum
> a. acquires a cleanup lock for page - 10
> b. busy in checking visibility of tuples
> --assume, here it takes some time and in the meantime Session-1
> performs step (a) and (b) and start waiting in step- (c)
> c. marks the page as all-visible (PageSetAllVisible)
> d. unlockandrelease the buffer
> 
> Session-1
> a. In heap_lock_tuple(), readbuffer for page-10
> b. check PageIsAllVisible(), found page is not all-visible, so didn't
> acquire the visbilitymap_pin
> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
> release the lock
> d. Got the lock, but now the page is marked as all-visible, so ideally
> need to recheck the page and acquire the visibilitymap_pin

So, I've tried pretty hard to reproduce that. While the theory above is
sound, I believe the relevant code-path is essentially dead for SQL
callable code, because we'll always hold a buffer pin before even
entering heap_update/heap_lock_tuple.  It's possible that you could
concoct a dangerous scenario with follow_updates though; but I can't
immediately see how.  Due to that, and based on the closing in beta
release, I'm planning to push a version of the patch that the returns
fixed; but not this.  It seems better to have the majority of the fix
in.

Andres

Re: Reviewing freeze map code

From

Andres Freund

Date:

18 July 2016, 12:28:40

On 2016-07-18 01:33:10 -0700, Andres Freund wrote:
> On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
> > On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
> > >> + /*
> > >> + * Before locking the buffer, pin the visibility map page if it may be
> > >> + * necessary.
> > >> + */
> > >>
> > >> + if (PageIsAllVisible(BufferGetPage(*buffer)))
> > >> + visibilitymap_pin(relation, block, &vmbuffer);
> > >> +
> > >>   LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
> > >>
> > >> I think we need to check for PageIsAllVisible and try to pin the
> > >> visibility map after taking the lock on buffer. I think it is quite
> > >> possible that in the time this routine tries to acquire lock on
> > >> buffer, the page becomes all visible.
> > >
> > > I don't see how. Without a cleanup lock it's not possible to mark a page
> > > all-visible/frozen.
> > >
> >
> > Consider the below scenario.
> >
> > Vacuum
> > a. acquires a cleanup lock for page - 10
> > b. busy in checking visibility of tuples
> > --assume, here it takes some time and in the meantime Session-1
> > performs step (a) and (b) and start waiting in step- (c)
> > c. marks the page as all-visible (PageSetAllVisible)
> > d. unlockandrelease the buffer
> >
> > Session-1
> > a. In heap_lock_tuple(), readbuffer for page-10
> > b. check PageIsAllVisible(), found page is not all-visible, so didn't
> > acquire the visbilitymap_pin
> > c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
> > release the lock
> > d. Got the lock, but now the page is marked as all-visible, so ideally
> > need to recheck the page and acquire the visibilitymap_pin
>
> So, I've tried pretty hard to reproduce that. While the theory above is
> sound, I believe the relevant code-path is essentially dead for SQL
> callable code, because we'll always hold a buffer pin before even
> entering heap_update/heap_lock_tuple.  It's possible that you could
> concoct a dangerous scenario with follow_updates though; but I can't
> immediately see how.  Due to that, and based on the closing in beta
> release, I'm planning to push a version of the patch that the returns
> fixed; but not this.  It seems better to have the majority of the fix
> in.

Pushed that way. Let's try to figure out a good solution to a) test this
case b) how to fix it in a reasonable way. Note that there's also
http://archives.postgresql.org/message-id/20160718071729.tlj4upxhaylwv75n%40alap3.anarazel.de
which seems related.

Regards,

Andres

Re: Reviewing freeze map code

From

Michael Paquier

Date:

19 July 2016, 07:22:45

On Sat, Jul 16, 2016 at 10:08 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-07-14 20:53:07 -0700, Andres Freund wrote:
>> On 2016-07-13 23:06:07 -0700, Andres Freund wrote:
>> > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set.  Which
>> > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
>> > standby / after crash recovery.   I'm failing to see any harmful
>> > consequences right now, but differences between master and standby are a bad
>> > thing.
>>
>> I think it's actually critical, because HEAP_HOT_UPDATED /
>> HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like
>> heap_hot_search_buffer()).
>
> I've pushed a quite heavily revised version of the first patch to
> 9.1-master. I manually verified using pageinspect, gdb breakpoints and a
> standby that xmax, infomask etc are set correctly (leading to finding
> a4d357bf).  As there's noticeable differences, especially 9.2->9.3,
> between versions, I'd welcome somebody having a look at the commits.

Waoh, man. Thanks!

I have been just pinged this week end about a set up that likely has
faced this exact problem in the shape of "tuple concurrently updated"
with a node getting kill-9-ed by some framework because it did not
finish its shutdown checkpoint after some time in some test which
enforced it to do crash recovery. I have not been able to put my hands
on the raw data to have a look at the flags set within those tuples
but I got the string feeling that this is related to that. After a
couple of rounds doing so, it was possible to see "tuple concurrently
updated" errors for a relation that has few pages and a high update
rate using 9.4.

More seriously, I have spent some time looking at what you have pushed
on each branch, and the fixes are looking correct to me.
-- 
Michael

Re: Reviewing freeze map code

From

Amit Kapila

Date:

23 July 2016, 10:56:08

On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
>> >
>>
>> Consider the below scenario.
>>
>> Vacuum
>> a. acquires a cleanup lock for page - 10
>> b. busy in checking visibility of tuples
>> --assume, here it takes some time and in the meantime Session-1
>> performs step (a) and (b) and start waiting in step- (c)
>> c. marks the page as all-visible (PageSetAllVisible)
>> d. unlockandrelease the buffer
>>
>> Session-1
>> a. In heap_lock_tuple(), readbuffer for page-10
>> b. check PageIsAllVisible(), found page is not all-visible, so didn't
>> acquire the visbilitymap_pin
>> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
>> release the lock
>> d. Got the lock, but now the page is marked as all-visible, so ideally
>> need to recheck the page and acquire the visibilitymap_pin
>
> So, I've tried pretty hard to reproduce that. While the theory above is
> sound, I believe the relevant code-path is essentially dead for SQL
> callable code, because we'll always hold a buffer pin before even
> entering heap_update/heap_lock_tuple.
>

It is possible that we don't hold any buffer pin before entering
heap_update() and or heap_lock_tuple().  For heap_update(), it is
possible when it enters via simple_heap_update() path.  For
heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
and may be others as well.  Let me also try to explain with a test for
both the cases, if above is not clear enough.

Case-1 for heap_update()
-----------------------------------
Session-1
Create table t1(c1 int);
Alter table t1 alter column c1 set default 10;  --via debugger stop at
StoreAttrDefault()/heap_update, while you are in heap_update(), note
down the block number

Session-2
vacuum (DISABLE_PAGE_SKIPPING) pg_attribute; -- In lazy_scan_heap(),
stop at line (if (all_visible && !all_visible_according_to_vm))) for
block number noted in Session-1.

Session-1
In debugger, proceed and let it wait at lockbuffer, note that it will
not pin the visibility map.

Session-2
Set the visibility flag and complete the operation.

Session-1
You will notice that it will attempt to unlock the buffer, pin the
visibility map, lock the buffer again.

Case-2 for heap_lock_tuple()
----------------------------------------
Session-1
Create table i_conflict(c1 int, c2 char(100));
Create unique index idx_u on i_conflict(c1);

Insert into i_conflict values(1,'aaa');
Insert into i_conflict values(1,'aaa') On Conflict (c1) Do Update Set
c2='bbb';  -- via debugger, stop at line 385 in nodeModifyTable.c (In
ExecInsert(), at
if (onconflict == ONCONFLICT_UPDATE)).

Session-2
-------------
vacuum (DISABLE_PAGE_SKIPPING) i_conflict --stop before setting the
all-visible flag

Session-1
--------------
In debugger, proceed and let it wait at lockbuffer, note that it will
not pin the visibility map.

Session-2
---------------
Set the visibility flag and complete the operation.

Session-1
--------------
PANIC:  wrong buffer passed to visibilitymap_clear  --this is problematic.

Attached patch fixes the problem for me.  Note, I have not tried to
reproduce the problem for heap_lock_updated_tuple_rec(), but I think
if you are convinced with above cases, then we should have a similar
check in it as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

pin_vm_lock_tuple-v1.patch

Re: Reviewing freeze map code

From

Robert Haas

Date:

27 July 2016, 00:54:27

On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Attached patch fixes the problem for me.  Note, I have not tried to
> reproduce the problem for heap_lock_updated_tuple_rec(), but I think
> if you are convinced with above cases, then we should have a similar
> check in it as well.

I don't think this hunk is correct:

+        /*
+         * If we didn't pin the visibility map page and the page has become
+         * all visible, we'll have to unlock and re-lock.  See heap_lock_tuple
+         * for details.
+         */
+        if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf)))
+        {
+            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+            visibilitymap_pin(rel, block, &vmbuffer);
+            LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+            goto l4;
+        }

The code beginning at label l4 appears that the buffer is unlocked,
but this code leaves the buffer unlocked.  Also, I don't see the point
of doing this test so far down in the function.  Why not just recheck
*immediately* after taking the buffer lock?  If you find out that you
need the pin after all, then           LockBuffer(buf,
BUFFER_LOCK_UNLOCK); visibilitymap_pin(rel, block, &vmbuffer);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); but *do not* go back to l4.
Unless I'm missing something, putting this block further down, as you
have it, buys nothing, because none of that intervening code can
release the buffer lock without using goto to jump back to l4.

+    /*
+     * If we didn't pin the visibility map page and the page has become all
+     * visible while we were busy locking the buffer, or during some
+     * subsequent window during which we had it unlocked, we'll have to unlock
+     * and re-lock, to avoid holding the buffer lock across an I/O.  That's a
+     * bit unfortunate, especially since we'll now have to recheck whether the
+     * tuple has been locked or updated under us, but hopefully it won't
+     * happen very often.
+     */
+    if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+    {
+        LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+        visibilitymap_pin(relation, block, &vmbuffer);
+        LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+        goto l3;
+    }

In contrast, this looks correct: l3 expects the buffer to be locked
already, and the code above this point and below the point this logic
can unlock and re-lock the buffer, potentially multiple times.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reviewing freeze map code

From

Amit Kapila

Date:

27 July 2016, 08:15:21

On Wed, Jul 27, 2016 at 3:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Attached patch fixes the problem for me.  Note, I have not tried to
>> reproduce the problem for heap_lock_updated_tuple_rec(), but I think
>> if you are convinced with above cases, then we should have a similar
>> check in it as well.
>
> I don't think this hunk is correct:
>
> +        /*
> +         * If we didn't pin the visibility map page and the page has become
> +         * all visible, we'll have to unlock and re-lock.  See heap_lock_tuple
> +         * for details.
> +         */
> +        if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf)))
> +        {
> +            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> +            visibilitymap_pin(rel, block, &vmbuffer);
> +            LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
> +            goto l4;
> +        }
>
> The code beginning at label l4 appears that the buffer is unlocked,
> but this code leaves the buffer unlocked.  Also, I don't see the point
> of doing this test so far down in the function.  Why not just recheck
> *immediately* after taking the buffer lock?
>

Right, in this case we can recheck immediately after taking buffer
lock, updated patch attached.  In the passing by, I have noticed that
heap_delete() doesn't do this unlocking, pinning of vm and locking at
appropriate place.  It just checks immediately after taking lock,
whereas in the down code, it do unlock and lock the buffer again.  I
think we should do it as in attached patch
(pin_vm_heap_delete-v1.patch).


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Thu, Aug 4, 2016 at 3:24 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2016-08-02 10:55:18 -0400, Noah Misch wrote:
>> [Action required within 72 hours.  This is a generic notification.]
>>
>> The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
>> since you committed the patch believed to have created it, you own this open
>> item.
>
> Well kinda (it was a partial fix for something not originally by me),
> but I'll deal with. Reading now, will commit tomorrow.

Thanks.  I kept meaning to get to this one, and failing to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company