Thread: crash-safe visibility map, take five

crash-safe visibility map, take five

From
Robert Haas
Date:
On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> Another question:
>> To address the problem in
>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
>> , should we just clear the vm before the log of insert/update/delete?
>> This may reduce the performance, is there another solution?
>
> Yeah, that's a straightforward way to fix it. I don't think the performance
> hit will be too bad. But we need to be careful not to hold locks while doing
> I/O, which might require some rearrangement of the code. We might want to do
> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
> lock and update the heap page, and then set the VM bit while holding the
> lock on the heap page.

Here's an attempt at implementing the necessary gymnastics.

Comments?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: crash-safe visibility map, take five

From
Rob Wultsch
Date:
On Fri, May 6, 2011 at 2:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Comments?

At my day job there is saying: "Silence is consent".

I am surprised there has not been more discussion of this change,
considering the magnitude of the possibilities it unlocks.


-- 
Rob Wultsch
wultsch@gmail.com


Re: crash-safe visibility map, take five

From
Merlin Moncure
Date:
On Fri, May 6, 2011 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>>> Another question:
>>> To address the problem in
>>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
>>> , should we just clear the vm before the log of insert/update/delete?
>>> This may reduce the performance, is there another solution?
>>
>> Yeah, that's a straightforward way to fix it. I don't think the performance
>> hit will be too bad. But we need to be careful not to hold locks while doing
>> I/O, which might require some rearrangement of the code. We might want to do
>> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
>> lock and update the heap page, and then set the VM bit while holding the
>> lock on the heap page.
>
> Here's an attempt at implementing the necessary gymnastics.

Is there a quick synopsis of why you have to do (sometimes) the
pin->lock->unlock->pin->lock mechanic? How come you only can fail to
get the pin at most once?

merlin


Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Mon, May 9, 2011 at 10:25 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, May 6, 2011 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>>> Another question:
>>>> To address the problem in
>>>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
>>>> , should we just clear the vm before the log of insert/update/delete?
>>>> This may reduce the performance, is there another solution?
>>>
>>> Yeah, that's a straightforward way to fix it. I don't think the performance
>>> hit will be too bad. But we need to be careful not to hold locks while doing
>>> I/O, which might require some rearrangement of the code. We might want to do
>>> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
>>> lock and update the heap page, and then set the VM bit while holding the
>>> lock on the heap page.
>>
>> Here's an attempt at implementing the necessary gymnastics.
>
> Is there a quick synopsis of why you have to do (sometimes) the
> pin->lock->unlock->pin->lock mechanic? How come you only can fail to
> get the pin at most once?

I thought I'd explained it fairly thoroughly in the comments, but
evidently not.  Suggestions for improvement are welcome.

Here goes in more detail: Every time we insert, update, or delete a
tuple in a particular heap page, we must check whether the page is
marked all-visible.  If it is, then we need to clear the page-level
bit marking it as all-visible, and also the corresponding page in the
visibility map.  On the other hand, if the page isn't marked
all-visible, then we needn't touch the visibility map at all.  So,
there are either one or two buffers involved: the buffer containing
the heap page ("buffer") and possibly also a buffer containing the
visibility map page in which the bit for the heap page is to be found
("vmbuffer").   Before taking an exclusive content-lock on the heap
buffer, we check whether the page appears to be all-visible.  If it
does, then we pin the visibility map page and then lock the buffer.
If not, we just lock the buffer.  However, since we weren't holding
any lock, it's possible that between the time when we checked the
visibility map bit and the time when we obtained the exclusive
buffer-lock, the visibility map bit might have changed from clear to
set (because someone is concurrently running VACUUM on the table; or
on platforms with weak memory-ordering, someone was running VACUUM
"almost" concurrently).  If that happens, we give up our buffer lock,
go pin the visibility map page, and reacquire the buffer lock.

At this point in the process, we know that *if* the page is marked
all-visible, *then* we have the appropriate visibility map page
pinned.  There are three possible pathways: (1) If the buffer
initially appeared to be all-visible, we will have pinned the
visibility map page before acquiring the exclusive lock; (2) If the
buffer initially appeared NOT to be all-visible, but by the time we
obtained the exclusive lock it now appeared to be all-visible, then we
will have done the unfortunate unlock-pin-relock dance, and the
visibility map page will now be pinned; (3) if the buffer initially
appeared NOT to be all-visible, and by the time we obtained the
exclusive lock it STILL appeared NOT to be all-visible, then we don't
have the visibility map page pinned - but that's OK, because in this
case no operation on the visibility map needs to be performed.

Now it is very possible that in case (1) or (2) the visibility map
bit, though we saw it set at some point, will actually have been
cleared in the meantime.  In case (1), this could happen before we
obtain the exclusive lock; while in case (2), it could happen after we
give up the lock to go pin the visibility map page and before we
reacquire it.  This will typically happen when a buffer has been
sitting around for a while in an all-visible state and suddenly two
different backends both try to update or delete tuples in that buffer
at almost exactly the same time.  But it causes no great harm - both
backends will pin the visibility map page, whichever one gets the
exclusive lock on the heap page first will clear it, and when the
other backend gets the heap page afterwards, it will see that the bit
has already been cleared and do nothing further.  We've wasted the
effort of pinning and unpinning the visibility map page when it wasn't
really necessary, but that's not the end of the world.

We could avoid all of this complexity - and the possibility of pinning
the visibility map page needlessly - by locking the heap buffer first
and then pinning the visibility map page if the heap page is
all-visible.  However, that would involve holding the lock on the heap
buffer across a possible disk I/O to bring the visibility map page
into memory, which is something the existing code tries pretty hard to
avoid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take five

From
Merlin Moncure
Date:
On Tue, May 10, 2011 at 7:48 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I thought I'd explained it fairly thoroughly in the comments, but
> evidently not.  Suggestions for improvement are welcome.

ok.  that clears it up nicely.

> Here goes in more detail: Every time we insert, update, or delete a
> tuple in a particular heap page, we must check whether the page is
> marked all-visible.  If it is, then we need to clear the page-level
> bit marking it as all-visible, and also the corresponding page in the
> visibility map.  On the other hand, if the page isn't marked
> all-visible, then we needn't touch the visibility map at all.  So,
> there are either one or two buffers involved: the buffer containing
> the heap page ("buffer") and possibly also a buffer containing the
> visibility map page in which the bit for the heap page is to be found
> ("vmbuffer").   Before taking an exclusive content-lock on the heap
> buffer, we check whether the page appears to be all-visible.  If it
> does, then we pin the visibility map page and then lock the buffer.
> If not, we just lock the buffer.

I see: here's a comment that was throwing me off:
+    /*
+     * If we didn't get the lock and it turns out we need it, we'll have to
+     * unlock and re-lock, to avoid holding the buffer lock across an I/O.
+     * That's a bit unfortunate, but hopefully shouldn't happen often.
+     */

I think that might be phrased as "didn't get the pin and it turns out
we need it because the bit can change after inspection".  The visible
bit isn't 'wrong' as suggested in the comments, it just can change so
that it becomes wrong.  Maybe a note of why it could change would be
helpful.

Other than that, it looks pretty good...ISTM an awfully small amount
of code to provide what it's doing (that's a good thing!).

merlin


Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Tue, May 10, 2011 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I see: here's a comment that was throwing me off:
> +       /*
> +        * If we didn't get the lock and it turns out we need it, we'll have to
> +        * unlock and re-lock, to avoid holding the buffer lock across an I/O.
> +        * That's a bit unfortunate, but hopefully shouldn't happen often.
> +        */
>
> I think that might be phrased as "didn't get the pin and it turns out
> we need it because the bit can change after inspection".  The visible
> bit isn't 'wrong' as suggested in the comments, it just can change so
> that it becomes wrong.  Maybe a note of why it could change would be
> helpful.

Oh, I see.  I did write "lock" when I meant "pin", and your other
point is well-taken as well.  Here's a revised version with some
additional wordsmithing.

> Other than that, it looks pretty good...ISTM an awfully small amount
> of code to provide what it's doing (that's a good thing!).

Thanks.  It's definitely not big in terms of code footprint; it's
mostly a matter of making sure we've dotted all the "i"s and crossed
all the "t"s.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: crash-safe visibility map, take five

From
Jesper Krogh
Date:
On 2011-05-10 14:48, Robert Haas wrote:
> We could avoid all of this complexity - and the possibility of pinning
> the visibility map page needlessly - by locking the heap buffer first
> and then pinning the visibility map page if the heap page is
> all-visible.  However, that would involve holding the lock on the heap
> buffer across a possible disk I/O to bring the visibility map page
> into memory, which is something the existing code tries pretty hard to
> avoid.
Assuming that the visibillity map would be used for visibillity testing,
just picking the lock would effectively mean "we want it in the buffers",
which would not be that bad?

Or what is the downside for keeping it across IO? Will it block other
processes trying to read it?

-- 
Jesper


Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 2011-05-10 14:48, Robert Haas wrote:
>>
>> We could avoid all of this complexity - and the possibility of pinning
>> the visibility map page needlessly - by locking the heap buffer first
>> and then pinning the visibility map page if the heap page is
>> all-visible.  However, that would involve holding the lock on the heap
>> buffer across a possible disk I/O to bring the visibility map page
>> into memory, which is something the existing code tries pretty hard to
>> avoid.
>
> Assuming that the visibillity map would be used for visibillity testing,
> just picking the lock would effectively mean "we want it in the buffers",
> which would not be that bad?
>
> Or what is the downside for keeping it across IO? Will it block other
> processes trying to read it?

Heikki might be in a better position to comment on that than I am,
since he wrote the existing code.  But I think that's basically the
issue.  When one process has an exclusive buffer lock, nobody else can
scan the tuples in that block - so a sequential scan, for example,
that reached that block, or an index scan that needed to probe into
it, would pile up behind the read of the visibility map page.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take five

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>> Or what is the downside for keeping it across IO? Will it block other
>> processes trying to read it?

> Heikki might be in a better position to comment on that than I am,
> since he wrote the existing code.  But I think that's basically the
> issue.  When one process has an exclusive buffer lock, nobody else can
> scan the tuples in that block - so a sequential scan, for example,
> that reached that block, or an index scan that needed to probe into
> it, would pile up behind the read of the visibility map page.

Right, it's the loss of potential concurrency that's annoying here.

On the other hand, the concurrency loss might be entirely theoretical
--- in particular, if other potential readers of the heap page would
probably also need to wait for the visibility page to come in, then
nothing is gained by letting them acquire the heap page lock sooner.

I've not read this thread in any detail yet, but if we're going to be
jumping through extremely complex hoops to avoid that scenario, it might
be better to KISS ... especially in the first iteration.
        regards, tom lane


Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Tue, May 10, 2011 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>>> Or what is the downside for keeping it across IO? Will it block other
>>> processes trying to read it?
>
>> Heikki might be in a better position to comment on that than I am,
>> since he wrote the existing code.  But I think that's basically the
>> issue.  When one process has an exclusive buffer lock, nobody else can
>> scan the tuples in that block - so a sequential scan, for example,
>> that reached that block, or an index scan that needed to probe into
>> it, would pile up behind the read of the visibility map page.
>
> Right, it's the loss of potential concurrency that's annoying here.
>
> On the other hand, the concurrency loss might be entirely theoretical
> --- in particular, if other potential readers of the heap page would
> probably also need to wait for the visibility page to come in, then
> nothing is gained by letting them acquire the heap page lock sooner.
>
> I've not read this thread in any detail yet, but if we're going to be
> jumping through extremely complex hoops to avoid that scenario, it might
> be better to KISS ... especially in the first iteration.

I wouldn't describe the hoops as extremely complex; I don't feel any
inclination to simplify the patch beyond what it is right now.  Of
course, we'll see what the feedback looks like after more people have
read the patch, but my feeling is that the patch strikes a reasonable
balance between performance and keeping it simple.  There are some
more complicated shenanigans that I started to experiment with and
ripped out as premature optimization, but this part I think is OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take five

From
Pavan Deolasee
Date:


On Tue, May 10, 2011 at 7:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 10, 2011 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I see: here's a comment that was throwing me off:
> +       /*
> +        * If we didn't get the lock and it turns out we need it, we'll have to
> +        * unlock and re-lock, to avoid holding the buffer lock across an I/O.
> +        * That's a bit unfortunate, but hopefully shouldn't happen often.
> +        */
>
> I think that might be phrased as "didn't get the pin and it turns out
> we need it because the bit can change after inspection".  The visible
> bit isn't 'wrong' as suggested in the comments, it just can change so
> that it becomes wrong.  Maybe a note of why it could change would be
> helpful.

Oh, I see.  I did write "lock" when I meant "pin", and your other
point is well-taken as well.  Here's a revised version with some
additional wordsmithing.


Some trivial comments.

Why do the set and clear functions need pass-by-reference (Buffer *) argument ? I don't see them modifying the argument at all. This patch adds the clear function, but the existing set function also suffers from that.

There are several invocations of pin/clear/release combos. May be you would want a convenience routine for doing that in a single step or just pass InvalidBuffer to clear() in which case, it would assume that the vm buffer is not pinned and do the needful.

The comment at the top of visibilitymap_pin_ok  says "On entry, *buf", but the function really takes just a buf. You can possibly fold visibilitymap_pin_ok() into a macro (and also name it slightly differently like visibilitymap_is_pinned ?).

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Wed, May 11, 2011 at 8:46 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Why do the set and clear functions need pass-by-reference (Buffer *)
> argument ? I don't see them modifying the argument at all. This patch adds
> the clear function, but the existing set function also suffers from that.

Yep, I just copied the style of the existing function, in the interest
of changing no more than necessary.  But maybe it'd be better to
change them both to take just the Buffer, instead of a pointer.
Anyone else want to weigh in?

> There are several invocations of pin/clear/release combos. May be you would
> want a convenience routine for doing that in a single step or just pass
> InvalidBuffer to clear() in which case, it would assume that the vm buffer
> is not pinned and do the needful.

I don't think there are enough copies of this logic to worry about it;
and I think that keeping those routines separate is more clear.

> The comment at the top of visibilitymap_pin_ok  says "On entry, *buf", but
> the function really takes just a buf.

Good catch.    Very slightly updated patch attached.

> You can possibly fold
> visibilitymap_pin_ok() into a macro (and also name it slightly differently
> like visibilitymap_is_pinned ?).

Well, that name would be kind of misleading, because it's not so much
checking whether we have a pin, but rather whether any pin we have is
the right one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: crash-safe visibility map, take five

From
Noah Misch
Date:
Robert,

I took a look at this patch.  To summarize its purpose as I understand it, it
does two independent but synergistic things:

1. Move the INSERT/UPDATE/DELETE clearing of visibility map bits from after
the main operation to before it.  This has no affect on crash recovery.  It
closes a race condition described in the last paragraphs of visibilitymap.c's
header comment.

2. In the words of a comment added by the patch:* The critical integrity requirement here is that we must never end up
with*a situation where the visibility map bit is set, and the page-level* PD_ALL_VISIBLE bit is clear.  If that were to
occur,then a subsequent* page modification would fail to clear the visibility map bit.
 
It does this by WAL-logging the operation of setting a vm bit.  This also has
the benefit of getting vm bits set correctly on standbys.

On Mon, May 23, 2011 at 03:54:44PM -0400, Robert Haas wrote:
> On Wed, May 11, 2011 at 8:46 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
> > Why do the set and clear functions need pass-by-reference (Buffer *)
> > argument ? I don't see them modifying the argument at all. This patch adds
> > the clear function, but the existing set function also suffers from that.
> 
> Yep, I just copied the style of the existing function, in the interest
> of changing no more than necessary.  But maybe it'd be better to
> change them both to take just the Buffer, instead of a pointer.
> Anyone else want to weigh in?

+1 for Buffer over Buffer * when the value isn't mutated for the caller.


I suggest revisiting the suggestion in
http://archives.postgresql.org/message-id/27743.1291135210@sss.pgh.pa.us and
including a latestRemovedXid in xl_heap_visible.  The range of risky xids is
the same for setting a visibility map bit as for deleting a dead tuple, and
the same operation (VACUUM) produces both conflicts.  The system that has
fueled my reports related to standby conflict would hit a snapshot conflict
more or less instantly without vacuum_defer_cleanup_age.  (With 9.1, we'll use
hot_standby_feedback and never look back.)  Adding visibility map bits as a
source of conflict costs nothing to a system like that.  You might not like it
on an INSERT-only system whose VACUUMs only update visibility.  Such systems
might like a GUC on the standby that disables both index-only scans and
conflict on xl_heap_visible.lastestRemovedXid.  For normal read/write systems,
there will be no advantage.

The patch follows a design hashed in Nov-Dec 2010.  Currently, it adds
overhead to VACUUM with thin practical benefit.  However, there's broad
community acceptance that this is a step on a deterministic path to success at
implementing index-only scans, for worthy benefit.


lazy_scan_heap() has two calls to visibilitymap_set(), but the patch only
changed the recptr argument for one of them.  This has the effect that we only
emit WAL for empty pages and pages that happened to have pd_lsn == {0,0}, such
as those not modified since the transaction creating the table.  I fixed this
before testing further.

Next, for no especially good reason, I set a breakpoint on the closing brace
of visibilitymap_set, ran a VACUUM that called it, and instigated a PANIC from
gdb.  Recovery failed like this:

3614 2011-06-16 20:05:18.118 EDT LOG:  00000: redo starts at 0/172ABA0
3614 2011-06-16 20:05:18.118 EDT LOCATION:  StartupXLOG, xlog.c:6494
3614 2011-06-16 20:05:18.119 EDT FATAL:  XX000: lock 148 is not held
3614 2011-06-16 20:05:18.119 EDT CONTEXT:  xlog redo visible: rel 1663/16384/24588; blk 0
3614 2011-06-16 20:05:18.119 EDT LOCATION:  LWLockRelease, lwlock.c:587
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1695)
3613 2011-06-16 20:05:18.119 EDT LOG:  00000: startup process (PID 3614) was terminated by signal 6: Aborted
3613 2011-06-16 20:05:18.119 EDT LOCATION:  LogChildExit, postmaster.c:2881
3613 2011-06-16 20:05:18.119 EDT LOG:  00000: aborting startup due to startup process failure
3613 2011-06-16 20:05:18.119 EDT LOCATION:  reaper, postmaster.c:2376

This happens due to heap_xlog_redo() calling UnlockReleaseBuffer() despite
having taken no buffer content lock.  I addedLockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
before the "if" to get past this.

I proceeded to do some immediate-shutdown tests to see if we get the bits set
as expected.  I set up a database like this:CREATE EXTENSION pageinspect;CREATE TABLE t (c int);INSERT INTO t VALUES
(2);CHECKPOINT;
I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
with "VACUUM t".  I checked bits with this query:SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
to_hex(flags::int)FROM page_header(get_raw_page('t', 0));
 
The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
set).  0,5 is fine after a crash.  1,1 means we've broken our contract: the VM
bit is set while PD_ALL_VISIBLE is not set.

First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the
postmaster.  The system came back up with 1/1 bits.  I poked around enough to
see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any
deeper toward a root cause.  If you have trouble reproducing this, let me know
so I can assemble a complete, self-contained test case.


To crudely assess overhead, I created a table and set its hint bits like this: create table t as select * from
generate_series(1,30000000);select count(*) from t;
 
autovacuum was off.  I built with -O2 and assert checking.  Then, I copied the
data directory and ran these commands multiple times with different builds: SELECT pg_current_xlog_insert_location();
SETdebug_assertions = off; VACUUM t; SELECT pg_current_xlog_insert_location();
 

Results:
-Build-        -xlog before-        -xlog after-        -VACUUM time (ms)-
patched        0/171E278            0/1D3FE68            50329.218            
patched        0/171E278            0/1D3FE68            48871.656
patched        0/171E278            0/1D3FE68            46545.024
unpatched    0/171E278            0/17202D0            49614.714
unpatched    0/171E278            0/17202D0            45966.985
unpatched    0/171E278            0/17202D0            44133.895

So, ~6 MiB of WAL to log the cleanness of ~1 GiB of table size.  Seems small
enough to not worry about optimizing further at this time.  I would like to
recheck it after we work out the functional issues, though.


No documentation updates present or needed.  The patch adds no tests, but we
don't have any recovery tests in `make check'.  That's bad, but it's not this
patch's job to fix it.  Patch is a unified diff, but I suppose it won't matter
since you will be committing it.

The code looks pretty clean; a few minor comments and questions:

> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 346d6b9..cf7d165 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c

> @@ -2298,6 +2333,10 @@ l1:
>  
>      LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>  
> +    /* We're done with the visibility map buffer */

I would delete this comment.  We were done earlier, but we needed to finish the
critical section.

> +    if (vmbuffer != InvalidBuffer)
> +        ReleaseBuffer(vmbuffer);
> +
>      /*
>       * If the tuple has toasted out-of-line attributes, we need to delete
>       * those items too.  We have to do this before releasing the buffer

> @@ -4331,6 +4435,78 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
>      UnlockReleaseBuffer(buffer);
>  }
>  
> +/*
> + * Replay XLOG_HEAP2_VISIBLE record.
> + *
> + * The critical integrity requirement here is that we must never end up with
> + * a situation where the visibility map bit is set, and the page-level
> + * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
> + * page modification would fail to clear the visibility map bit.
> + */
> +static void
> +heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
> +{
> +    xl_heap_visible *xlrec = (xl_heap_visible *) XLogRecGetData(record);
> +    Buffer        buffer;
> +    Page        page;
> +
> +    /*
> +     * Read the heap page, if it still exists.  If the heap file has been
> +     * dropped or truncated later in recovery, this might fail.  In that case,
> +     * there's no point in doing anything further, since the visibility map
> +     * will have to be cleared out at the same time.
> +     */
> +    buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block,
> +                                    RBM_NORMAL);
> +    if (!BufferIsValid(buffer))
> +        return;
> +    page = (Page) BufferGetPage(buffer);
> +
> +    /*
> +     * We don't bump the LSN of the heap page when setting the visibility
> +     * map bit, because that would generate an unworkable volume of
> +     * full-page writes.  This exposes us to torn page hazards, but since
> +     * we're not inspecting the existing page contents in any way, we
> +     * don't care.
> +     *
> +     * However, all operations that clear the visibility map bit *do* bump
> +     * the LSN, and those operations will only be replayed if the XLOG LSN
> +     * precedes the page LSN.  Thus, if the page LSN has advanced past our
> +     * XLOG record's LSN, we mustn't mark the page all-visible, because
> +     * the subsequent update won't be replayed to clear the flag.
> +     */

Concerning the optimization in xlog_heap_delete() et al. of not changing the
page when its LSN is newer -- am I correct that it only ever triggers when
full_page_writes = off?  Assuming yes ...

> +    if (XLByteLE(lsn, PageGetLSN(page)))

(As mentioned above, this test failed improperly in one of my tests.)

> +    {
> +        PageSetAllVisible(page);
> +        MarkBufferDirty(buffer);
> +    }
> +
> +    /* Done with heap page. */
> +    UnlockReleaseBuffer(buffer);

(As mentioned above, no lock is held at this point.)

> +
> +    /*
> +     * Even we skipped the heap page update due to the LSN interlock, it's
> +     * still safe to update the visibility map.  Any WAL record that clears
> +     * the visibility map bit does so before checking the page LSN, so any
> +     * bits that need to be cleared will still be cleared.
> +     */
> +    if (record->xl_info & XLR_BKP_BLOCK_1)
> +        RestoreBkpBlocks(lsn, record, false);
> +    else
> +    {
> +        Relation    reln;
> +        Buffer        vmbuffer = InvalidBuffer;
> +
> +        reln = CreateFakeRelcacheEntry(xlrec->node);
> +        visibilitymap_pin(reln, xlrec->block, &vmbuffer);
> +        /* Don't set the bit if replay has already passed this point. */
> +        if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
> +            visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer);

... wouldn't it be better to do this unconditionally?  Some later record will
unset it if needed, because all replay functions that clear the bit do so
without consulting the vm page LSN.  On the other hand, the worst consequence
of the way you've done it is VACUUM visiting the page one more time to set the
bit.

> +        ReleaseBuffer(vmbuffer);
> +        FreeFakeRelcacheEntry(reln);
> +    }
> +}
> +
>  static void
>  heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
>  {

> --- a/src/backend/access/heap/hio.c
> +++ b/src/backend/access/heap/hio.c

> @@ -237,23 +239,37 @@ RelationGetBufferForTuple(Relation relation, Size len,
>           * Read and exclusive-lock the target block, as well as the other
>           * block if one was given, taking suitable care with lock ordering and
>           * the possibility they are the same block.
> +         *
> +         * If the page-level all-visible flag is set, caller will need to clear
> +         * both that and the corresponding visibility map bit.  However, by the
> +         * time we return, we'll have x-locked the buffer, and we don't want to
> +         * do any I/O while in that state.  So we check the bit here before
> +         * taking the lock, and pin the page if it appears necessary.
> +         * Checking without the lock creates a risk of getting the wrong
> +         * answer, so we'll have to recheck after acquiring the lock.
>           */

I think it's worth noting, perhaps based on your explanation in the
second-to-last paragraph of
http://archives.postgresql.org/message-id/BANLkTi=b7jVmq6fA_EXLCYgzuyV1u9at4A@mail.gmail.com
that the answer may be incorrect again after the recheck.  We don't care:
redundant clearings of the visibility bit are no problem.

> @@ -261,11 +277,36 @@ RelationGetBufferForTuple(Relation relation, Size len,
>          {
>              /* lock target buffer first */
>              buffer = ReadBuffer(relation, targetBlock);
> +            if (PageIsAllVisible(BufferGetPage(buffer)))
> +                visibilitymap_pin(relation, targetBlock, vmbuffer);
>              LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
>              LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
>          }
>  
>          /*
> +         * If the page is all visible but we don't have the right visibility
> +         * map page pinned, then give up our locks, go get the pin, and
> +         * re-lock.  This is pretty painful, but hopefully shouldn't happen
> +         * often.  Note that there's a small possibility that we didn't pin
> +         * the page above but still have the correct page pinned anyway, either
> +         * because we've already made a previous pass through this loop, or
> +         * because caller passed us the right page anyway.
> +         */
> +        if (PageIsAllVisible(BufferGetPage(buffer))
> +            && !visibilitymap_pin_ok(targetBlock, *vmbuffer))
> +        {
> +            LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> +            if (otherBlock != targetBlock)
> +                LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
> +            visibilitymap_pin(relation, targetBlock, vmbuffer);

Suppose you pin one VM page for a certain candidate heap page, then loop
around to another candidate that has a different VM page.  I don't see
anywhere that you ReleaseBuffer() the first vm page; am I missing it?  I
didn't try to construct a test case, but a relation with a too-small free
space block below 512 MiB and a sufficiently-large block above 512 MiB should
tickle that case.  The function header comment should document that vmbuffer,
if not InvalidBuffer, must be pinned and will be unpinned if modified (or
whatever other semantics you choose).

> +            if (otherBuffer != InvalidBuffer && otherBlock < targetBlock)
> +                LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
> +            LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> +            if (otherBuffer != InvalidBuffer && otherBlock > targetBlock)
> +                LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
> +        }
> +
> +        /*
>           * Now we can check to see if there's enough free space here. If so,
>           * we're done.
>           */
> diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
> index 58bab7d..3a33463 100644
> --- a/src/backend/access/heap/visibilitymap.c
> +++ b/src/backend/access/heap/visibilitymap.c
> @@ -11,10 +11,11 @@
>   *      src/backend/access/heap/visibilitymap.c
>   *
>   * INTERFACE ROUTINES
> - *        visibilitymap_clear - clear a bit in the visibility map
> - *        visibilitymap_pin    - pin a map page for setting a bit
> - *        visibilitymap_set    - set a bit in a previously pinned page
> - *        visibilitymap_test    - test if a bit is set
> + *        visibilitymap_clear  - clear a bit in the visibility map
> + *        visibilitymap_pin     - pin a map page for setting a bit
> + *      visibilitymap_pin_ok - check whether correct map page is already pinned
> + *        visibilitymap_set     - set a bit in a previously pinned page
> + *        visibilitymap_test     - test if a bit is set

Several lines above are inconsistent on internal tabs/spaces.

>   *
>   * NOTES
>   *
> @@ -76,19 +77,11 @@
>   * index would still be visible to all other backends, and deletions wouldn't
>   * be visible to other backends yet.  (But HOT breaks that argument, no?)

The two paragraphs ending above are ready to be removed.

>   *
> - * There's another hole in the way the PD_ALL_VISIBLE flag is set. When
> - * vacuum observes that all tuples are visible to all, it sets the flag on
> - * the heap page, and also sets the bit in the visibility map. If we then
> - * crash, and only the visibility map page was flushed to disk, we'll have
> - * a bit set in the visibility map, but the corresponding flag on the heap
> - * page is not set. If the heap page is then updated, the updater won't
> - * know to clear the bit in the visibility map.  (Isn't that prevented by
> - * the LSN interlock?)
> - *
>   *-------------------------------------------------------------------------
>   */
>  #include "postgres.h"
>  
> +#include "access/heapam.h"
>  #include "access/visibilitymap.h"
>  #include "storage/bufmgr.h"
>  #include "storage/bufpage.h"
> @@ -127,38 +120,37 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
>  /*
>   *    visibilitymap_clear - clear a bit in visibility map
>   *
> - * Clear a bit in the visibility map, marking that not all tuples are
> - * visible to all transactions anymore.
> + * You must pass a buffer containing the correct map page to this function.
> + * Call visibilitymap_pin first to pin the right one. This function doesn't do
> + * any I/O.
>   */
>  void
> -visibilitymap_clear(Relation rel, BlockNumber heapBlk)
> +visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer *buf)
>  {
>      BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
>      int            mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
>      int            mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
>      uint8        mask = 1 << mapBit;
> -    Buffer        mapBuffer;
>      char       *map;
>  
>  #ifdef TRACE_VISIBILITYMAP
>      elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
>  #endif
>  
> -    mapBuffer = vm_readbuf(rel, mapBlock, false);
> -    if (!BufferIsValid(mapBuffer))
> -        return;                    /* nothing to do */
> +    if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
> +        elog(ERROR, "wrong buffer passed to visibilitymap_clear");
>  

This could just as well be an Assert.

> -    LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
> -    map = PageGetContents(BufferGetPage(mapBuffer));
> +    LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
> +    map = PageGetContents(BufferGetPage(*buf));
>  
>      if (map[mapByte] & mask)
>      {
>          map[mapByte] &= ~mask;
>  
> -        MarkBufferDirty(mapBuffer);
> +        MarkBufferDirty(*buf);
>      }
>  
> -    UnlockReleaseBuffer(mapBuffer);
> +    LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
>  }
>  
>  /*
> @@ -194,15 +186,32 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
>  }
>  
>  /*
> + *    visibilitymap_pin_ok - do we already have the correct page pinned?
> + *
> + * On entry, buf should be InvalidBuffer or a valid buffer returned by
> + * an earlier call to visibilitymap_pin or visibilitymap_test on the same
> + * relation.  The return value indicates whether the buffer covers the
> + * given heapBlk.
> + */
> +bool
> +visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
> +{
> +    BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
> +
> +    return BufferIsValid(buf) && BufferGetBlockNumber(buf) == mapBlock;
> +}

This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem to
fit the task too well.  This function doesn't check the pin or that the buffer
applies to the correct relation.  Consider naming it something like
`visibilitymap_buffer_covers_block'.

> +
> +/*
>   *    visibilitymap_set - set a bit on a previously pinned page
>   *
> - * recptr is the LSN of the heap page. The LSN of the visibility map page is
> - * advanced to that, to make sure that the visibility map doesn't get flushed
> - * to disk before the update to the heap page that made all tuples visible.
> + * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
> + * or InvalidXLogRecPtr in normal running.  The page LSN is advanced to the
> + * one provided; in normal running, we generate a new XLOG record and set the
> + * page LSN to that value.

I like the sentiment of sharing more code between the regular and redo cases,
but I don't care for this approach.  This doesn't look like an invocation that
skips WAL for the operation, but it is:visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer);
In fact, it's a near-certainly wrong way to call this function, because
PageGetLSN(page) may or may not be valid.  On the other hand, with just three
calls sites and those not likely to grow much, we can just get it right.
Still, consider adding:Assert(InRecovery || XLogRecPtrIsInvalid(recptr));

>   *
> - * This is an opportunistic function. It does nothing, unless *buf
> - * contains the bit for heapBlk. Call visibilitymap_pin first to pin
> - * the right map page. This function doesn't do any I/O.
> + * You must pass a buffer containing the correct map page to this function.
> + * Call visibilitymap_pin first to pin the right one. This function doesn't do
> + * any I/O.
>   */
>  void
>  visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
> @@ -220,7 +229,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
>  
>      /* Check that we have the right page pinned */
>      if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
> -        return;
> +        elog(ERROR, "wrong buffer passed to visibilitymap_set");

Likewise, this could just as well be an Assert.

>  
>      page = BufferGetPage(*buf);
>      map = PageGetContents(page);
> @@ -230,7 +239,10 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
>      {
>          map[mapByte] |= (1 << mapBit);
>  
> -        if (XLByteLT(PageGetLSN(page), recptr))
> +        if (XLogRecPtrIsInvalid(recptr) && RelationNeedsWAL(rel))
> +            recptr = log_heap_visible(rel->rd_node, heapBlk, *buf);
> +
> +        if (!XLogRecPtrIsInvalid(recptr))
>              PageSetLSN(page, recptr);
>          PageSetTLI(page, ThisTimeLineID);

Might this be clearer as follows?

if (RelationNeedsWAL(rel))
{if (XLogRecPtrIsInvalid(recptr))    recptr = log_heap_visible(rel->rd_node, heapBlk, *buf);PageSetLSN(page,
recptr);PageSetTLI(page,ThisTimeLineID);
 
}

> --- a/src/backend/commands/vacuumlazy.c
> +++ b/src/backend/commands/vacuumlazy.c
> @@ -479,7 +479,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
>                  visibilitymap_pin(onerel, blkno, &vmbuffer);
>                  LockBuffer(buf, BUFFER_LOCK_SHARE);
>                  if (PageIsAllVisible(page))
> -                    visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer);
> +                    visibilitymap_set(onerel, blkno, InvalidXLogRecPtr,
> +                                      &vmbuffer);

As mentioned above, there's another call to visibilitymap_set() later in this
function that needs the same change.

Thanks,
nm


Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Thu, Jun 16, 2011 at 11:17 PM, Noah Misch <noah@leadboat.com> wrote:
> I took a look at this patch.

No kidding!  Thanks for the very detailed review.

> +1 for Buffer over Buffer * when the value isn't mutated for the caller.

I changed this.

> I suggest revisiting the suggestion in
> http://archives.postgresql.org/message-id/27743.1291135210@sss.pgh.pa.us and
> including a latestRemovedXid in xl_heap_visible.  The range of risky xids is
> the same for setting a visibility map bit as for deleting a dead tuple, and
> the same operation (VACUUM) produces both conflicts.

See Heikki's follow-up email.  The standby has to ignore all-visible
bits anyway, because the fact that a transaction is all-visible on the
master does not imply that it is all-visible on the standby.  So I
don't think there's a problem here.

> lazy_scan_heap() has two calls to visibilitymap_set(), but the patch only
> changed the recptr argument for one of them.  This has the effect that we only
> emit WAL for empty pages and pages that happened to have pd_lsn == {0,0}, such
> as those not modified since the transaction creating the table.  I fixed this
> before testing further.

Good catch, thanks.  I also added the Assert() that you recommended
further down.

> This happens due to heap_xlog_redo() calling UnlockReleaseBuffer() despite
> having taken no buffer content lock.  I added
>        LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> before the "if" to get past this.

Fixed, thanks.

> I proceeded to do some immediate-shutdown tests to see if we get the bits set
> as expected.  I set up a database like this:
>        CREATE EXTENSION pageinspect;
>        CREATE TABLE t (c int);
>        INSERT INTO t VALUES (2);
>        CHECKPOINT;
> I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
> with "VACUUM t".  I checked bits with this query:
>        SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
>               to_hex(flags::int) FROM page_header(get_raw_page('t', 0));
> The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
> set).  0,5 is fine after a crash.  1,1 means we've broken our contract: the VM
> bit is set while PD_ALL_VISIBLE is not set.
>
> First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the
> postmaster.  The system came back up with 1/1 bits.  I poked around enough to
> see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any
> deeper toward a root cause.  If you have trouble reproducing this, let me know
> so I can assemble a complete, self-contained test case.

Thank you for putting these test cases together.  As you can probably
tell, I was having difficulty figuring out exactly how to test this.

I think that the problem here is that the sense of that test is
exactly backwards from what it should be.  IIUC, the word "precedes"
in the previous comment should in fact say "follows".

> I would delete this comment.  We were done earlier, but we needed to finish the
> critical section.

Done.

> Concerning the optimization in xlog_heap_delete() et al. of not changing the
> page when its LSN is newer -- am I correct that it only ever triggers when
> full_page_writes = off?  Assuming yes ...

I believe that's right.

>> +     /*
>> +      * Even we skipped the heap page update due to the LSN interlock, it's
>> +      * still safe to update the visibility map.  Any WAL record that clears
>> +      * the visibility map bit does so before checking the page LSN, so any
>> +      * bits that need to be cleared will still be cleared.
>> +      */
>> +     if (record->xl_info & XLR_BKP_BLOCK_1)
>> +             RestoreBkpBlocks(lsn, record, false);
>> +     else
>> +     {
>> +             Relation        reln;
>> +             Buffer          vmbuffer = InvalidBuffer;
>> +
>> +             reln = CreateFakeRelcacheEntry(xlrec->node);
>> +             visibilitymap_pin(reln, xlrec->block, &vmbuffer);
>> +             /* Don't set the bit if replay has already passed this point. */
>> +             if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
>> +                     visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer);
>
> ... wouldn't it be better to do this unconditionally?  Some later record will
> unset it if needed, because all replay functions that clear the bit do so
> without consulting the vm page LSN.  On the other hand, the worst consequence
> of the way you've done it is VACUUM visiting the page one more time to set the
> bit.

Hmm, now that I look at it, I think this test is backwards too.  I
think you might be right that it wouldn't hurt anything to go ahead
and set it, but I'm inclined to leave it in for now.

> I think it's worth noting, perhaps based on your explanation in the
> second-to-last paragraph of
> http://archives.postgresql.org/message-id/BANLkTi=b7jVmq6fA_EXLCYgzuyV1u9at4A@mail.gmail.com
> that the answer may be incorrect again after the recheck.  We don't care:
> redundant clearings of the visibility bit are no problem.

I added a comment.  See what you think.

> Suppose you pin one VM page for a certain candidate heap page, then loop
> around to another candidate that has a different VM page.  I don't see
> anywhere that you ReleaseBuffer() the first vm page; am I missing it?  I
> didn't try to construct a test case, but a relation with a too-small free
> space block below 512 MiB and a sufficiently-large block above 512 MiB should
> tickle that case.  The function header comment should document that vmbuffer,
> if not InvalidBuffer, must be pinned and will be unpinned if modified (or
> whatever other semantics you choose).

visibilitymap_pin() releases the old pin if it decides to switch pages.

> Several lines above are inconsistent on internal tabs/spaces.

Fixed... I hope.

> The two paragraphs ending above are ready to be removed.

Done.

> This could just as well be an Assert.

Maybe, but I'm going to leave it the way it is for now.  It's not the
only place in the code where we do stuff like that.  The cost of
testing it in a non-assert-enabled build is quite small.

> This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem to
> fit the task too well.  This function doesn't check the pin or that the buffer
> applies to the correct relation.  Consider naming it something like
> `visibilitymap_buffer_covers_block'.

I think that this may be related to the fact that visibilitymap_pin()
doesn't do what you might expect.  But that name is pre-existing, so I
kept it, and I don't really want this to be something that sounds
totally unrelated.

> Might this be clearer as follows?

Yep, done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: crash-safe visibility map, take five

From
Noah Misch
Date:
On Fri, Jun 17, 2011 at 01:21:17PM -0400, Robert Haas wrote:
> On Thu, Jun 16, 2011 at 11:17 PM, Noah Misch <noah@leadboat.com> wrote:
> > I suggest revisiting the suggestion in
> > http://archives.postgresql.org/message-id/27743.1291135210@sss.pgh.pa.us and
> > including a latestRemovedXid in xl_heap_visible. ?The range of risky xids is
> > the same for setting a visibility map bit as for deleting a dead tuple, and
> > the same operation (VACUUM) produces both conflicts.
>
> See Heikki's follow-up email.  The standby has to ignore all-visible
> bits anyway, because the fact that a transaction is all-visible on the
> master does not imply that it is all-visible on the standby.  So I
> don't think there's a problem here.

Indeed.  If we conflicted on the xmin of the most-recent all-visible tuple
when setting the bit, all-visible on the standby would become a superset of
all-visible on the master, and we could cease to ignore the bits.  This is an
excellent trade-off for masters that do UPDATE and DELETE, because they're
already conflicting (or using avoidance measures) on similar xids at VACUUM
time.  (Some INSERT-only masters will disfavor the trade-off, but we could
placate them with a GUC.  On the other hand, hot_standby_feedback works and
costs an INSERT-only master so little.)

> > I proceeded to do some immediate-shutdown tests to see if we get the bits set
> > as expected. ?I set up a database like this:
> > ? ? ? ?CREATE EXTENSION pageinspect;
> > ? ? ? ?CREATE TABLE t (c int);
> > ? ? ? ?INSERT INTO t VALUES (2);
> > ? ? ? ?CHECKPOINT;
> > I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
> > with "VACUUM t". ?I checked bits with this query:
> > ? ? ? ?SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
> > ? ? ? ? ? ? ? to_hex(flags::int) FROM page_header(get_raw_page('t', 0));
> > The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
> > set). ?0,5 is fine after a crash. ?1,1 means we've broken our contract: the VM
> > bit is set while PD_ALL_VISIBLE is not set.
> >
> > First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the
> > postmaster. ?The system came back up with 1/1 bits. ?I poked around enough to
> > see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any
> > deeper toward a root cause. ?If you have trouble reproducing this, let me know
> > so I can assemble a complete, self-contained test case.
>
> Thank you for putting these test cases together.  As you can probably
> tell, I was having difficulty figuring out exactly how to test this.

No problem.  I ran these test cases with the new patch:

-Description-                        -Expected bits-        -Result-
set bit, VACUUM commit, crash        1,5                    1,5
set bit, crash                        0,1                    0,1
set bit, flush heap page, crash        0,5                    0,5
set bit, flush vm page, crash        1,5                    1,5

xlog flush request locations look correct, too.  Overall, looking good.  Do
you know of other notable cases to check?  The last two were somewhat tricky
to inject; if anyone wishes to reproduce them, I'm happy to share my recipe.

One more thing came to mind: don't we need a critical section inside the "if"
block in visibilitymap_set()?  I can't see anything in there that could
elog(ERROR, ...), but my general impression is that we use one anyway.

I ran one repetition of my benchmark from last report, and the amount of WAL
has not changed.  Given that, I think the previous runs remain valid.

> I think that the problem here is that the sense of that test is
> exactly backwards from what it should be.  IIUC, the word "precedes"
> in the previous comment should in fact say "follows".

Ah; quite so.

> >> + ? ? ? ? ? ? /* Don't set the bit if replay has already passed this point. */
> >> + ? ? ? ? ? ? if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
> >> + ? ? ? ? ? ? ? ? ? ? visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer);
> >
> > ... wouldn't it be better to do this unconditionally? ?Some later record will
> > unset it if needed, because all replay functions that clear the bit do so
> > without consulting the vm page LSN. ?On the other hand, the worst consequence
> > of the way you've done it is VACUUM visiting the page one more time to set the
> > bit.
>
> Hmm, now that I look at it, I think this test is backwards too.  I
> think you might be right that it wouldn't hurt anything to go ahead
> and set it, but I'm inclined to leave it in for now.

Okay.  Consider expanding the comment to note that this is out of conservatism
rather than known necessity.

> > I think it's worth noting, perhaps based on your explanation in the
> > second-to-last paragraph of
> > http://archives.postgresql.org/message-id/BANLkTi=b7jVmq6fA_EXLCYgzuyV1u9at4A@mail.gmail.com
> > that the answer may be incorrect again after the recheck. ?We don't care:
> > redundant clearings of the visibility bit are no problem.
>
> I added a comment.  See what you think.

That should help.

> > Several lines above are inconsistent on internal tabs/spaces.
>
> Fixed... I hope.

Probably not worth mentioning, but there remains one space instead of one tab
after "visibilitymap_clear" in this line:
 *        visibilitymap_clear     - clear a bit in the visibility map

FWIW, I just do C-s TAB in emacs and then scan for places where the boundaries
of the highlighted regions fail to line up vertically.

> > This could just as well be an Assert.
>
> Maybe, but I'm going to leave it the way it is for now.  It's not the
> only place in the code where we do stuff like that.  The cost of
> testing it in a non-assert-enabled build is quite small.

Okay.

> > This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem to
> > fit the task too well. ?This function doesn't check the pin or that the buffer
> > applies to the correct relation. ?Consider naming it something like
> > `visibilitymap_buffer_covers_block'.
>
> I think that this may be related to the fact that visibilitymap_pin()
> doesn't do what you might expect.  But that name is pre-existing, so I
> kept it, and I don't really want this to be something that sounds
> totally unrelated.

Ah, so visibilitymap_pin_ok() answers the question "Will visibilitymap_pin()
be a long-winded no-op given these arguments?"


Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Sat, Jun 18, 2011 at 9:37 AM, Noah Misch <noah@leadboat.com> wrote:
> Indeed.  If we conflicted on the xmin of the most-recent all-visible tuple
> when setting the bit, all-visible on the standby would become a superset of
> all-visible on the master, and we could cease to ignore the bits.  This is an
> excellent trade-off for masters that do UPDATE and DELETE, because they're
> already conflicting (or using avoidance measures) on similar xids at VACUUM
> time.  (Some INSERT-only masters will disfavor the trade-off, but we could
> placate them with a GUC.  On the other hand, hot_standby_feedback works and
> costs an INSERT-only master so little.)

I hadn't really considered the possibility that making more things
conflict on the standby server could work out to a win.  I think that
would need some careful testing, and I don't want to get tied up in
that for purposes of this patch.  There is no law against a WAL format
change, so we can always do it later if someone wants to do the
legwork.

> No problem.  I ran these test cases with the new patch:
>
> -Description-                                           -Expected bits-         -Result-
> set bit, VACUUM commit, crash           1,5                                     1,5
> set bit, crash                                          0,1                                     0,1
> set bit, flush heap page, crash         0,5                                     0,5
> set bit, flush vm page, crash           1,5                                     1,5
>
> xlog flush request locations look correct, too.  Overall, looking good.  Do
> you know of other notable cases to check?  The last two were somewhat tricky
> to inject; if anyone wishes to reproduce them, I'm happy to share my recipe.

Those sound like interesting recipes...

> One more thing came to mind: don't we need a critical section inside the "if"
> block in visibilitymap_set()?  I can't see anything in there that could
> elog(ERROR, ...), but my general impression is that we use one anyway.

Seems like a good idea.

> I ran one repetition of my benchmark from last report, and the amount of WAL
> has not changed.  Given that, I think the previous runs remain valid.

As far as I can see, the upshot of those benchmarks was that more WAL
was generated, but the time to execute vacuum didn't really change.  I
think that's going to be pretty typical.  In your test, the additional
WAL amounted to about 0.6% of the amount of data VACUUM is dirtying,
which I think is pretty much in the noise, assuming wal_buffers and
checkpoint_segments are tuned to suitable values.

The other thing to worry about from a performance view is whether the
push-ups required to relocate the clear-the-vm-bits logic to within
the critical sections is going to have a significant impact on
ordinary insert/update/delete operations.  I was a bit unsure at first
about Heikki's contention that it wouldn't matter, but after going
through the exercise of writing the code I think I see now why he
wasn't concerned.  The only possibly-noticeable overhead comes from
the possibility that we might decide not to pin the visibility map
buffer before locking the page(s) we're operating on, and need to
unlock-pin-and-relock.  But for that to happen, VACUUM's got to buzz
through and mark the page all-visible just around the time that we
examine bit.  And I have a tough time imagining that happening very
often.  You have to have a page that has been modified since the last
VACUUM, but not too recently (otherwise it's not actually all-visible)
and then VACUUM has to get there and process it at almost the same
moment that someone decides to make another modification.  It's hard
even to think about an artificial test case that would hit that
situation with any regularity.

>> Hmm, now that I look at it, I think this test is backwards too.  I
>> think you might be right that it wouldn't hurt anything to go ahead
>> and set it, but I'm inclined to leave it in for now.
>
> Okay.  Consider expanding the comment to note that this is out of conservatism
> rather than known necessity.

Done.

> Probably not worth mentioning, but there remains one space instead of one tab
> after "visibilitymap_clear" in this line:
>  *              visibilitymap_clear      - clear a bit in the visibility map

Gee, I'm not sure whether there's a one true way of doing that.  I
know we have a no-spaces-before-tabs rule but I'm not sure what the
guidelines are for indenting elsewhere in the line.

> FWIW, I just do C-s TAB in emacs and then scan for places where the boundaries
> of the highlighted regions fail to line up vertically.

vim.

>> > This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem to
>> > fit the task too well. ?This function doesn't check the pin or that the buffer
>> > applies to the correct relation. ?Consider naming it something like
>> > `visibilitymap_buffer_covers_block'.
>>
>> I think that this may be related to the fact that visibilitymap_pin()
>> doesn't do what you might expect.  But that name is pre-existing, so I
>> kept it, and I don't really want this to be something that sounds
>> totally unrelated.
>
> Ah, so visibilitymap_pin_ok() answers the question "Will visibilitymap_pin()
> be a long-winded no-op given these arguments?"

Bingo.  In HEAD, there's no real reason to care; you may as well just
try it and see what happens.  But the buffer lock dance makes it
necessary.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: crash-safe visibility map, take five

From
Noah Misch
Date:
On Sat, Jun 18, 2011 at 10:04:52PM -0400, Robert Haas wrote:
> On Sat, Jun 18, 2011 at 9:37 AM, Noah Misch <noah@leadboat.com> wrote:
> > Indeed. ?If we conflicted on the xmin of the most-recent all-visible tuple
> > when setting the bit, all-visible on the standby would become a superset of
> > all-visible on the master, and we could cease to ignore the bits. ?This is an
> > excellent trade-off for masters that do UPDATE and DELETE, because they're
> > already conflicting (or using avoidance measures) on similar xids at VACUUM
> > time. ?(Some INSERT-only masters will disfavor the trade-off, but we could
> > placate them with a GUC. ?On the other hand, hot_standby_feedback works and
> > costs an INSERT-only master so little.)
> 
> I hadn't really considered the possibility that making more things
> conflict on the standby server could work out to a win.  I think that
> would need some careful testing, and I don't want to get tied up in
> that for purposes of this patch.  There is no law against a WAL format
> change, so we can always do it later if someone wants to do the
> legwork.

Fair enough.

> > I ran one repetition of my benchmark from last report, and the amount of WAL
> > has not changed. ?Given that, I think the previous runs remain valid.
> 
> As far as I can see, the upshot of those benchmarks was that more WAL
> was generated, but the time to execute vacuum didn't really change.  I
> think that's going to be pretty typical.  In your test, the additional
> WAL amounted to about 0.6% of the amount of data VACUUM is dirtying,
> which I think is pretty much in the noise, assuming wal_buffers and
> checkpoint_segments are tuned to suitable values.

I think that's about right.  The timings were too unstable to conclude much,
so I've focused on the WAL volume, which isn't worrisome.

> The other thing to worry about from a performance view is whether the
> push-ups required to relocate the clear-the-vm-bits logic to within
> the critical sections is going to have a significant impact on
> ordinary insert/update/delete operations.  I was a bit unsure at first
> about Heikki's contention that it wouldn't matter, but after going
> through the exercise of writing the code I think I see now why he
> wasn't concerned.

I agree; that part of the cost looks unmeasurable.

> The only possibly-noticeable overhead comes from
> the possibility that we might decide not to pin the visibility map
> buffer before locking the page(s) we're operating on, and need to
> unlock-pin-and-relock.  But for that to happen, VACUUM's got to buzz
> through and mark the page all-visible just around the time that we
> examine bit.  And I have a tough time imagining that happening very
> often.  You have to have a page that has been modified since the last
> VACUUM, but not too recently (otherwise it's not actually all-visible)
> and then VACUUM has to get there and process it at almost the same
> moment that someone decides to make another modification.  It's hard
> even to think about an artificial test case that would hit that
> situation with any regularity.

I can't see it happening very often, either.


This version of the patch has some bogus hunks in int.c, int8.c, and heap.c
reverting a few of your other recent commits.  Looks good otherwise.  I've
marked the patch "Ready for Committer".

Thanks,
nm


Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Sat, Jun 18, 2011 at 10:41 PM, Noah Misch <noah@leadboat.com> wrote:
> This version of the patch has some bogus hunks in int.c, int8.c, and heap.c
> reverting a few of your other recent commits.

Woops, good catch.  Corrected version attached, in case anyone else
wants to take a look at it.

> Looks good otherwise.  I've
> marked the patch "Ready for Committer".

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: crash-safe visibility map, take five

From
Jeff Davis
Date:
On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote:
> 2. In the words of a comment added by the patch:
>  * The critical integrity requirement here is that we must never end up with
>  * a situation where the visibility map bit is set, and the page-level
>  * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
>  * page modification would fail to clear the visibility map bit.
> It does this by WAL-logging the operation of setting a vm bit.  This also has
> the benefit of getting vm bits set correctly on standbys.

In the same function, there is also the comment:

"We don't bump the LSN of the heap page when setting the visibility
map bit, because that would generate an unworkable volume of
full-page writes.  This exposes us to torn page hazards, but since
we're not inspecting the existing page contents in any way, we
don't care."

It would be nice to have a comment explaining why that is safe with
respect to the WAL-before-data rule. Obviously torn pages aren't much of
a problem, because it's a single bit and completely idempotent.

Regards,Jeff Davis




Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Wed, Jun 22, 2011 at 8:53 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote:
>> 2. In the words of a comment added by the patch:
>>  * The critical integrity requirement here is that we must never end up with
>>  * a situation where the visibility map bit is set, and the page-level
>>  * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
>>  * page modification would fail to clear the visibility map bit.
>> It does this by WAL-logging the operation of setting a vm bit.  This also has
>> the benefit of getting vm bits set correctly on standbys.
>
> In the same function, there is also the comment:
>
> "We don't bump the LSN of the heap page when setting the visibility
> map bit, because that would generate an unworkable volume of
> full-page writes.  This exposes us to torn page hazards, but since
> we're not inspecting the existing page contents in any way, we
> don't care."
>
> It would be nice to have a comment explaining why that is safe with
> respect to the WAL-before-data rule. Obviously torn pages aren't much of
> a problem, because it's a single bit and completely idempotent.

That's exactly what I was trying to explain in the comment you cite.
Feel free to propose a specific change...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take five

From
Jeff Davis
Date:
On Wed, 2011-06-22 at 21:53 -0400, Robert Haas wrote:
> On Wed, Jun 22, 2011 at 8:53 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote:
> >> 2. In the words of a comment added by the patch:
> >>  * The critical integrity requirement here is that we must never end up with
> >>  * a situation where the visibility map bit is set, and the page-level
> >>  * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
> >>  * page modification would fail to clear the visibility map bit.
> >> It does this by WAL-logging the operation of setting a vm bit.  This also has
> >> the benefit of getting vm bits set correctly on standbys.
> >
> > In the same function, there is also the comment:
> >
> > "We don't bump the LSN of the heap page when setting the visibility
> > map bit, because that would generate an unworkable volume of
> > full-page writes.  This exposes us to torn page hazards, but since
> > we're not inspecting the existing page contents in any way, we
> > don't care."
> >
> > It would be nice to have a comment explaining why that is safe with
> > respect to the WAL-before-data rule. Obviously torn pages aren't much of
> > a problem, because it's a single bit and completely idempotent.
> 
> That's exactly what I was trying to explain in the comment you cite.
> Feel free to propose a specific change...

Well, I was a little unsure, but here's my attempt:

The potential problems are:
1. Torn pages -- not a problem because it's a single bit and idempotent.
2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing
an action that makes the page all-visible. Depending on what action
makes a page all-visible: a. An old snapshot is released -- not a problem, because if there is a
crash all snapshots are released. b. Cleanup action on the page -- not a problem, because that will
create a WAL record and update the page's LSN before setting the
PD_ALL_VISIBLE.

First of all, is that correct? Second, are there other cases to
consider?

Regards,Jeff Davis



Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Wed, Jun 22, 2011 at 10:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> 1. Torn pages -- not a problem because it's a single bit and idempotent.

Right.

> 2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing
> an action that makes the page all-visible. Depending on what action
> makes a page all-visible:
>  a. An old snapshot is released -- not a problem, because if there is a
> crash all snapshots are released.
>  b. Cleanup action on the page -- not a problem, because that will
> create a WAL record and update the page's LSN before setting the
> PD_ALL_VISIBLE.

Lazy VACUUM is the only thing that makes a page all visible.  I don't
understand the part about snapshots.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take five

From
Jeff Davis
Date:
On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote:
> Lazy VACUUM is the only thing that makes a page all visible.  I don't
> understand the part about snapshots.

Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE. 

After an INSERT to a new page, and after all snapshots are released, the
page becomes all-visible; and thus subject to being marked with
PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there
is no cleanup action that takes place here, so nothing else will bump
the LSN either.

So, let's say that we hypothetically had persistent snapshots, then
you'd have the following problem:

1. INSERT to a new page, marking it with LSN X
2. WAL flushed to LSN Y (Y > X)
2. Some persistent snapshot (that doesn't see the INSERT) is released,
and generates WAL recording that fact with LSN Z (Z > Y)
3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
4. page is written out because LSN is still X
5. crash

Now, the persistent snapshot is still present because LSN Z never made
it to disk; but the page is marked with PD_ALL_VISIBLE.

Sure, if these hypothetical persistent snapshots were transactional, and
if synchronous_commit is on, then LSN Z would be flushed before step 3;
but that's another set of assumptions. That's why I left it simple and
said that the assumption was "snapshots are released if there's a
crash".

Regards,Jeff Davis



Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Thu, Jun 23, 2011 at 6:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote:
>> Lazy VACUUM is the only thing that makes a page all visible.  I don't
>> understand the part about snapshots.
>
> Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE.
>
> After an INSERT to a new page, and after all snapshots are released, the
> page becomes all-visible; and thus subject to being marked with
> PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there
> is no cleanup action that takes place here, so nothing else will bump
> the LSN either.
>
> So, let's say that we hypothetically had persistent snapshots, then
> you'd have the following problem:
>
> 1. INSERT to a new page, marking it with LSN X
> 2. WAL flushed to LSN Y (Y > X)
> 2. Some persistent snapshot (that doesn't see the INSERT) is released,
> and generates WAL recording that fact with LSN Z (Z > Y)
> 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
> 4. page is written out because LSN is still X
> 5. crash
>
> Now, the persistent snapshot is still present because LSN Z never made
> it to disk; but the page is marked with PD_ALL_VISIBLE.
>
> Sure, if these hypothetical persistent snapshots were transactional, and
> if synchronous_commit is on, then LSN Z would be flushed before step 3;
> but that's another set of assumptions. That's why I left it simple and
> said that the assumption was "snapshots are released if there's a
> crash".

I don't really think that's a separate set of assumptions - if we had
some system whereby snapshots could survive a crash, then they'd have
to be WAL-logged (because that's how we make things survive crashes).
And anything that is WAL-logged must obey the WAL-before-data rule.
We have a system already that ensures that when
synchronous_commit=off, CLOG pages can't be flushed before the
corresponding WAL record makes it to disk.  For a system like what
you're describing, you'd need something similar - these
crash-surviving snapshots would have to make sure that no action which
depended on their state hit the disk before the WAL record marking the
state change hit the disk.

I guess the point you are driving at here is that a page can only go
from being all-visible to not-all-visible by virtue of being modified.There's no other piece of state (like a
persistentsnapshot) that can 
be lost as part of a crash that would make us need change our mind and
decide that an all-visible XID is really not all-visible after all.
(The reverse is not true: since snapshots are ephemeral, a crash will
render every row either all-visible or dead.)  I guess I never thought
about documenting that particular aspect of it because (to me) it
seems fairly self-evident.  Maybe I'm wrong...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take five

From
Jeff Davis
Date:
On Thu, 2011-06-23 at 22:02 -0400, Robert Haas wrote:
> > 1. INSERT to a new page, marking it with LSN X
> > 2. WAL flushed to LSN Y (Y > X)
> > 2. Some persistent snapshot (that doesn't see the INSERT) is released,
> > and generates WAL recording that fact with LSN Z (Z > Y)
> > 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
> > 4. page is written out because LSN is still X
> > 5. crash

> I don't really think that's a separate set of assumptions - if we had
> some system whereby snapshots could survive a crash, then they'd have
> to be WAL-logged (because that's how we make things survive crashes).

In the above example, it is WAL-logged (with LSN Z).

> And anything that is WAL-logged must obey the WAL-before-data rule.
> We have a system already that ensures that when
> synchronous_commit=off, CLOG pages can't be flushed before the
> corresponding WAL record makes it to disk.

In this case, how do you prevent the PD_ALL_VISIBLE from making it to
disk if you never bumped the LSN when it was set? It seems like you just
don't have the information to do so, and it seems like the information
required would be variable in size.

> I guess the point you are driving at here is that a page can only go
> from being all-visible to not-all-visible by virtue of being modified.
>  There's no other piece of state (like a persistent snapshot) that can
> be lost as part of a crash that would make us need change our mind and
> decide that an all-visible XID is really not all-visible after all.
> (The reverse is not true: since snapshots are ephemeral, a crash will
> render every row either all-visible or dead.)  I guess I never thought
> about documenting that particular aspect of it because (to me) it
> seems fairly self-evident.  Maybe I'm wrong...

I didn't mean to make this conversation quite so hypothetical. My
primary points are:

1. Sometimes it makes sense to break the typical WAL conventions for
performance reasons. But when we do so, we have to be quite careful,
because things get complicated quickly.

2. PD_ALL_VISIBLE is a little bit more complex than other hint bits,
because the conditions under which it may be set are more complex
(having to do with both snapshots and cleanup actions). Other hint bits
are based only on transaction status: either the WAL for that
transaction completion got flushed (and is therefore permanent), and we
set the hint bit; or it didn't get flushed and we don't.

Just having this discussion has been good enough for me to get a better
idea what's going on, so if you think the comments are sufficient that's
OK with me.

Regards,Jeff Davis



Re: crash-safe visibility map, take five

From
Robert Haas
Date:
On Fri, Jun 24, 2011 at 1:43 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> And anything that is WAL-logged must obey the WAL-before-data rule.
>> We have a system already that ensures that when
>> synchronous_commit=off, CLOG pages can't be flushed before the
>> corresponding WAL record makes it to disk.
>
> In this case, how do you prevent the PD_ALL_VISIBLE from making it to
> disk if you never bumped the LSN when it was set? It seems like you just
> don't have the information to do so, and it seems like the information
> required would be variable in size.

Well, I think that would be a problem for the hypothetical implementer
of the persistent snapshot feature.  :-)

More seriously, Heikki and I previously discussed creating some
systematic method for suppressing FPIs when they are not needed,
perhaps by using a bit in the page header to indicate whether an FPI
has been generated since the last checkpoint.  I think it would be
nice to have such a system, but since we don't have a clear agreement
either that it's a good idea or what we'd do after that, I'm not
inclined to invest time in it.  To really get any benefit out of a
change in that area, we'd need probably need to (a) remove the LSN
interlocks that prevent changes from being replayed if the LSN of the
page has already advanced beyond the record LSN and (b) change at
least some of XLOG_HEAP_{INSERT,UPDATE,DELETE} to be idempotent.  But
if we went in that direction then that might help to regularize some
of this and make it a bit less ad-hoc.

> I didn't mean to make this conversation quite so hypothetical. My
> primary points are:
>
> 1. Sometimes it makes sense to break the typical WAL conventions for
> performance reasons. But when we do so, we have to be quite careful,
> because things get complicated quickly.

Yes.

> 2. PD_ALL_VISIBLE is a little bit more complex than other hint bits,
> because the conditions under which it may be set are more complex
> (having to do with both snapshots and cleanup actions). Other hint bits
> are based only on transaction status: either the WAL for that
> transaction completion got flushed (and is therefore permanent), and we
> set the hint bit; or it didn't get flushed and we don't.

I think the term "hint bits" really shouldn't be applied to anything
other than HEAP_{XMIN,XMAX}_{COMMITTED,INVALID}.  Otherwise, we get
into confusing territory pretty quickly.  Our algorithm for
opportunistically killing index entries pointing to dead tuples is not
WAL-logged, but it involves more than a single bit.  OTOH, clearing of
the PD_ALL_VISIBLE bit has always been WAL-logged, so lumping that in
with HEAP_XMIN_COMMITTED is pretty misleading.

> Just having this discussion has been good enough for me to get a better
> idea what's going on, so if you think the comments are sufficient that's
> OK with me.

I'm not 100% certain they are, but let's wait and see if anyone else
wants to weigh in...  please do understand I'm not trying to be a pain
in the neck.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company