Thread: AIO writes vs hint bits vs checksums

AIO writes vs hint bits vs checksums

From
Andres Freund
Date:
Hi,

Currently we modify pages while just holding a share lock, for hint bit
writes. Writing a buffer out only requires a share lock. Because of that we
can't compute checksums and write out pages in-place, as a concurent hint bit
write can easily corrupt the checksum.

That's not great, but not awful for our current synchronous buffered IO, as
we only ever have a single page being written out at a time.

However, it becomes a problem even if we just want to write out in chunks
larger than a single page - we'd need to reserve not just one BLCKSZ sized
buffer for this, but make it PG_IOV_MAX * BLCKSZ sized. Perhaps still
tolerable.

With AIO this becomes a considerably bigger issue:
a) We can't just have one write in progress at a time, but many
b) To be able to implement AIO using workers the "source" or "target" memory
   of the IO needs to be in shared memory

Besides that, the need to copy the buffers makes checkpoints with AIO
noticeably slower when checksums are enabled - it's not the checksum but the
copy that's the biggest source of the slowdown.


So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.


Which lead me down the path of trying to avoid the need for the copy in the
first place:  What if we don't modify pages while it's undergoing IO?

The naive approach would be to not set hint bits with just a shared lock - but
that doesn't seem viable at all. For performance we rely on hint bits being
set and in many cases we'll only encounter the page in shared mode. We could
implement a conditional lock upgrade to an exclusive lock and do so while
setting hint bits, but that'd obviously be concerning from a concurrency point
of view.

What I suspect we might want instead is something inbetween a share and an
exclusive lock, which is taken while setting a hint bit and which conflicts
with having an IO in progress.

On first blush it might sound attractive to introduce this on the level of
lwlocks. However, I doubt that is a good idea - it'd make lwlock.c more
complicated which would imply overhead for other users, while the precise
semantics would be fairly specific to buffer locking.  A variant of this would
be to generalize lwlock.c to allow implementing different kinds of locks more
easily. But that's a significant project on its own and doesn't really seem
necessary for this specific project.

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.


Having a page level approach to setting hint bits has other advantages:

E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
on the page and we don't mark a page dirty on hot standby. Which often will
result in hint bits notpersistently set on replicas until the page is frozen.

Another issue is that we'll often WAL log hint bits for a page (due to hint
bits being set), just to then immediately log another WAL record for the same
page (e.g. for pruning), which is obviously wasteful. With a different
interface we could combine the WAL records for both.

I've not prototyped either, but I'm fairly confident they'd be helpful.


Does this sound like a reasonable idea?  Counterpoints?  If it does sound
reasonable, I'll clean up my pile of hacks into something
semi-understandable...


Greetings,

Andres Freund



Re: AIO writes vs hint bits vs checksums

From
Noah Misch
Date:
On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> So far the AIO patchset has solved this by introducing a set of "bounce
> buffers", which can be acquired and used as the source/target of IO when doing
> it in-place into shared buffers isn't viable.
> 
> I am worried about that solution however, as either acquisition of bounce
> buffers becomes a performance issue (that's how I did it at first, it was hard
> to avoid regressions) or we reserve bounce buffers for each backend, in which
> case the memory overhead for instances with relatively small amount of
> shared_buffers and/or many connections can be significant.

> But: We can address this and improve performance over the status quo! Today we
> determine tuple visiblity determination one-by-one, even when checking the
> visibility of an entire page worth of tuples. That's not exactly free. I've
> prototyped checking visibility of an entire page of tuples at once and it
> indeed speeds up visibility checks substantially (in some cases seqscans are
> over 20% faster!).

Nice!  It sounds like you refactored the relationship between
heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
bit setting upward or the iterate-over-tuples downward.  Is that about right?

> Once we have page-level visibility checks we can get the right to set hint
> bits once for an entire page instead of doing it for every tuple - with that
> in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
> wins.

How did page-level+BM_SETTING_HINTS performance compare to performance of the
page-level change w/o the BM_SETTING_HINTS change?

> Having a page level approach to setting hint bits has other advantages:
> 
> E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
> on the page and we don't mark a page dirty on hot standby. Which often will
> result in hint bits notpersistently set on replicas until the page is frozen.

Nice way to improve that.

> Does this sound like a reasonable idea?  Counterpoints?

I guess the main part left to discuss is index scans or other scan types where
we'd either not do page-level visibility or we'd do page-level visibility
including tuples we wouldn't otherwise use.  BM_SETTING_HINTS likely won't
show up so readily in index scan profiles, but the cost is still there.  How
should we think about comparing the distributed cost of the buffer header
manipulations during index scans vs. the costs of bounce buffers?

Thanks,
nm



Re: AIO writes vs hint bits vs checksums

From
Andres Freund
Date:
Hi,

On 2024-09-24 12:43:40 -0700, Noah Misch wrote:
> On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> > So far the AIO patchset has solved this by introducing a set of "bounce
> > buffers", which can be acquired and used as the source/target of IO when doing
> > it in-place into shared buffers isn't viable.
> >
> > I am worried about that solution however, as either acquisition of bounce
> > buffers becomes a performance issue (that's how I did it at first, it was hard
> > to avoid regressions) or we reserve bounce buffers for each backend, in which
> > case the memory overhead for instances with relatively small amount of
> > shared_buffers and/or many connections can be significant.
>
> > But: We can address this and improve performance over the status quo! Today we
> > determine tuple visiblity determination one-by-one, even when checking the
> > visibility of an entire page worth of tuples. That's not exactly free. I've
> > prototyped checking visibility of an entire page of tuples at once and it
> > indeed speeds up visibility checks substantially (in some cases seqscans are
> > over 20% faster!).
>
> Nice!  It sounds like you refactored the relationship between
> heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
> bit setting upward or the iterate-over-tuples downward.  Is that about right?

I've tried about five variations, so I don't have one answer to this yet :).

One problem is that having repeated loops doing PageGetItemId(),
PageGetItem(), ItemIdGetLength() isn't exactly free. To some degree it can be
hidden by allowing for better superscalar execution, but not entirely.

I've been somewhat confused by the compiler generated code around ItemId
handling for a while, it looks way more expensive than it should - it
regularly is a bottleneck due to the sheer number of instructions being
executed leading to being frontend bound. But never quite looked into it
deeply enough to figure out what's causing it / how to fix it.


> > Once we have page-level visibility checks we can get the right to set hint
> > bits once for an entire page instead of doing it for every tuple - with that
> > in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
> > wins.
>
> How did page-level+BM_SETTING_HINTS performance compare to performance of the
> page-level change w/o the BM_SETTING_HINTS change?

Just ran that. There probably is a performance difference, but it's small
(<0.5%) making it somewhat hard to be certain. It looks like the main reason
for that is ConditionVariableBroadcast() on the iocv shows up even though
nobody is waiting.

I've been fighting that with AIO as well, so maybe it's time to figure out the
memory ordering rules that'd allow to check that without a full spinlock
acquisition.

If we figure it out, we perhaps should use the chance to get rid of
BM_PIN_COUNT_WAITER...


> > Does this sound like a reasonable idea?  Counterpoints?
>
> I guess the main part left to discuss is index scans or other scan types where
> we'd either not do page-level visibility or we'd do page-level visibility
> including tuples we wouldn't otherwise use.  BM_SETTING_HINTS likely won't
> show up so readily in index scan profiles, but the cost is still there.

I could indeed not make it show up in some simple index lookup heavy
workloads.  I need to try some more extreme cases though (e.g. fetching all
tuples in a table via an index or having very long HOT chains).

If it's not visible cost-wise compared to all the other costs of index scans -
does it matter?  If it's not visible it's either because it proportionally is
very small or because it's completely hidden by superscalar execution.


Another thing I forgot to mention that probably fits into the "tradeoffs"
bucket: Because BM_SETTING_HINTS would be just a single bit, one backend
setting hint bits would block out another backend setting hint bits. In most
situations that'll be fine, or even faster than not doing so due to reducing
cache line ping-pong, but in cases of multiple backends doing index lookups to
different unhinted tuples on the same page it could be a bit worse.

But I suspect that's fine because it's hard to believe that you could have
enough index lookups to unhinted tuples for that to be a bottleneck -
something has to produce all those unhinted tuples after all, and that's
rather more expensive. And for single-tuple visibility checks the window in
which hint bits are set is very small.


> How should we think about comparing the distributed cost of the buffer
> header manipulations during index scans vs. the costs of bounce buffers?

Well, the cost of bounce buffers would be born as long as postgres is up,
whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
really show up. If there are cases where the cost of the more expensive hint
bit logic does show up, it'll get a lot harder to weigh.


So far my prototype uses the path that avoids hint bits being set while IO is
going on all the time, not just when checksums are enabled. We could change
that, I guess. However, our habit of modifying buffers while IO is going on is
causing issues with filesystem level checksums as well, as evidenced by the
fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
tend to think it'd be better to just stop doing that alltogether (we also do
that for WAL, when writing out a partial page, but a potential fix there would
be different, I think).



A thing I forgot to mention: Bounce buffers are kind of an architectural dead
end, in that we wouln't need them in that form if we get to a threaded
postgres.


Zooming out (a lot) more: I like the idea of having a way to get the
permission to perform some kinds of modifications on a page without an
exlusive lock. While obviously a lot more work, I do think there's some
potential to have some fast-paths that perform work on a page level without
blocking out readers. E.g. some simple cases of insert could correctly be done
without blocking out readers (by ordering the update of the max page offset


Greetings,

Andres Freund



Re: AIO writes vs hint bits vs checksums

From
Thomas Munro
Date:
On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote:
> Just ran that. There probably is a performance difference, but it's small
> (<0.5%) making it somewhat hard to be certain. It looks like the main reason
> for that is ConditionVariableBroadcast() on the iocv shows up even though
> nobody is waiting.

. o O { Gotta fix that.  Memory barriers might be enough to check for
empty wait list?, and even in the slow path, atomic wait lists or
something better than spinlocks... }

> However, our habit of modifying buffers while IO is going on is
> causing issues with filesystem level checksums as well, as evidenced by the
> fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
> tend to think it'd be better to just stop doing that alltogether (we also do
> that for WAL, when writing out a partial page, but a potential fix there would
> be different, I think).

+many.  Interesting point re the WAL variant.  For the record, here's
some discussion and a repro for that problem, which Andrew currently
works around in a build farm animal with mount options:

https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com



Re: AIO writes vs hint bits vs checksums

From
Noah Misch
Date:
On Tue, Sep 24, 2024 at 04:30:25PM -0400, Andres Freund wrote:
> On 2024-09-24 12:43:40 -0700, Noah Misch wrote:
> > On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> > > Besides that, the need to copy the buffers makes checkpoints with AIO
> > > noticeably slower when checksums are enabled - it's not the checksum but the
> > > copy that's the biggest source of the slowdown.

How big is that copy's contribution to the slowdown there?  A measurable CPU
overhead on writes likely does outweigh the unmeasurable overhead on index
scans, but ...

> > > Does this sound like a reasonable idea?  Counterpoints?

> > How should we think about comparing the distributed cost of the buffer
> > header manipulations during index scans vs. the costs of bounce buffers?
> 
> Well, the cost of bounce buffers would be born as long as postgres is up,
> whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
> really show up.

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision.  From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers.  I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability.  Bounce
buffers have an implementation.  They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

> Zooming out (a lot) more: I like the idea of having a way to get the
> permission to perform some kinds of modifications on a page without an
> exlusive lock. While obviously a lot more work, I do think there's some
> potential to have some fast-paths that perform work on a page level without
> blocking out readers. E.g. some simple cases of insert could correctly be done
> without blocking out readers (by ordering the update of the max page offset

True.



Re: AIO writes vs hint bits vs checksums

From
Antonin Houska
Date:
Andres Freund <andres@anarazel.de> wrote:

> What I'd instead like to propose is to implement the right to set hint bits as
> a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
> named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
> BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
> BM_SETTING_HINTS to be unset to start IO.
>
> Naively implementing this, by acquiring and releasing the permission to set
> hint bits in SetHintBits() unfortunately leads to a significant performance
> regression. While the performance is unaffected for OLTPish workloads like
> pgbench (both read and write), sequential scans of unhinted tables regress
> significantly, due to the per-tuple lock acquisition this would imply.

An alternative approach: introduce a flag that tells that the checksum is
being computed, and disallow setting hint bits when that flag is set. As long
as the checksum computation takes take much less time than the IO, fewer hint
bit updates should be rejected.

Of course, SetHintBits() would have to update the checksum too. But if it
could determine where the hint bit is located in the buffer and if "some
intermediate state" of the computation was maintained for each page in shared
buffers, then the checksum update might be cheaper than the initial
computation. But I'm not sure I understand the algorithm enough.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: AIO writes vs hint bits vs checksums

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> Andres Freund <andres@anarazel.de> wrote:
>
> > What I'd instead like to propose is to implement the right to set hint bits as
> > a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
> > named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
> > BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
> > BM_SETTING_HINTS to be unset to start IO.
> >
> > Naively implementing this, by acquiring and releasing the permission to set
> > hint bits in SetHintBits() unfortunately leads to a significant performance
> > regression. While the performance is unaffected for OLTPish workloads like
> > pgbench (both read and write), sequential scans of unhinted tables regress
> > significantly, due to the per-tuple lock acquisition this would imply.
>
> An alternative approach: introduce a flag that tells that the checksum is
> being computed, and disallow setting hint bits when that flag is set. As long
> as the checksum computation takes take much less time than the IO, fewer hint
> bit updates should be rejected.

Well, the checksum actually should not be computed during the IO, so the IO
would still disallow hint bit updates :-(

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: AIO writes vs hint bits vs checksums

From
Thomas Munro
Date:
On Wed, Sep 25, 2024 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote:
> > However, our habit of modifying buffers while IO is going on is
> > causing issues with filesystem level checksums as well, as evidenced by the
> > fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
> > tend to think it'd be better to just stop doing that alltogether (we also do
> > that for WAL, when writing out a partial page, but a potential fix there would
> > be different, I think).
>
> +many.  Interesting point re the WAL variant.  For the record, here's
> some discussion and a repro for that problem, which Andrew currently
> works around in a build farm animal with mount options:
>
> https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com

Here's an interesting new development in that area, this time from
OpenZFS, which committed its long awaited O_DIRECT support a couple of
weeks ago[1] and seems to have taken a different direction since that
last discussion.  Clearly it has the same checksum stability problem
as BTRFS and PostgreSQL itself, so an O_DIRECT mode with the goal of
avoiding copying and caching must confront that and break *something*,
or accept something like bounce buffers and give up the zero-copy
goal.  Curiously, they seem to have landed on two different solutions
with three different possible behaviours: (1) On FreeBSD, temporarily
make the memory non-writeable, (2) On Linux, they couldn't do that so
they have an extra checksum verification on write.  I haven't fully
grokked all this yet, or even tried it, and it's not released or
anything, but it looks a bit like all three behaviours are bad for our
current hint bit design: on FreeBSD, setting a hint bit might crash
(?) if a write is in progress in another process, and on Linux,
depending on zfs_vdev_direct_write_verify, either the concurrent write
might fail (= checkpointer failing on EIO because someone concurrently
set a hint bit) or a later read might fail (= file is permanently
corrupted and you don't find out until later, like btrfs).  I plan to
look more closely soon and see if I understood that right...

[1] https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda



Re: AIO writes vs hint bits vs checksums

From
Heikki Linnakangas
Date:
On 30/10/2024 04:21, Andres Freund wrote:
> Attached is a, unfortunately long, series of patches implementing what I
> described upthread.

Review of the preparatory patches:

> 0001 Add very basic test for kill_prior_tuples
> 
>      We currently don't exercise this patch for gist and hash, which seems
>      somewhat criminal.

Interesting to use the isolationtester for this. There's just one 
session, so you're just using it to define reusable steps with handy 
names. I'm fine with that, but please add a comment to explain it. I 
wonder if it'd be more straightforward to make it a regular pg_regress 
test though. There would be some repetition, but would it be so bad?

You forgot to add the new test to 'isolation_schedule'.

typos:
  "inex" -> "index"
  "does something approximately reasonble" -> "do something 
approximately reasonable"

> 0002 heapam: Move logic to handle HEAP_MOVED into a helper function
> 
>      A prep patch, which consolidates HEAP_MOVED handling into a helper. This
>      isn't strictly necessary, but I got bothered by it while writing this patch
>      series.

+1

> +/*
> + * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
> + * adjust hint bits. See the comment for SetHintBits() for more background.
> + *
> + * This helper returns false if the row ought to be invisible, true otherwise.
> + */
> +static inline bool
> +HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
> +{
> +    TransactionId xvac;
> +
> +    /* only used by pre-9.0 binary upgrades */
> +    if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
> +        return true;
> +
> +     ...
> +}

This is so unlikely these days that this function probably should not be 
inlined anymore.

> 0003 bufmgr: Add BufferLockHeldByMe()
> 
>      Useful for writing assertions in later patches.

> +    if (mode == BUFFER_LOCK_EXCLUSIVE)
> +        lwmode = LW_EXCLUSIVE;
> +    else if (mode == BUFFER_LOCK_SHARE)
> +        lwmode = LW_SHARED;
> +    else
> +    {
> +        Assert(false);
> +        pg_unreachable();
> +        lwmode = LW_EXCLUSIVE;  /* assuage compiler */
> +    }

I just learned a new word :-).

This is fine, but to avoid the assuaging, you could also do this:

if (mode == BUFFER_LOCK_EXCLUSIVE)
     lwmode = LW_EXCLUSIVE;
else
{
     Assert(mode == BUFFER_LOCK_SHARE);
     lwmode = LW_SHARED;
}

> 0004 heapam: Use exclusive lock on old page in CLUSTER
> 
>      When I started looking into this I wanted to make sure that it's actually
>      safe to skip setting hint bits in all places. One place that's *not* safe
>      is cluster - we assume that all hint bits are set in rewrite_heap_tuple().
> 
>      This might not strictly be required, but it seemed very fragile as
>      is. While using an exclusive lock does seem like it makes it a bit safer,
>      it's not a very good fix, as we still dirty the old buffer, which seems
>      like a shame when one uses VACUUM FULL.

> +        /*
> +         * To be able to guarantee that we can set the hint bit, acquire an
> +         * exclusive lock on the old buffer. We need the hint bits to be set
> +         * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
> +         * heap_freeze_tuple() will get confused.
> +         *

Hmm, I don't see any comments in the reform_and_rewrite_tuple() -> 
rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint 
bits to be set. How does it get confused?

> 0005 heapam: Only set tuple's block once per page in pagemode
> 
>      Minor, but measurable, optimization.

+1, this makes sense independently of all the other patches.

> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 1748eafa100..d0cb4b1e29b 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -981,6 +981,9 @@ heapgettup_pagemode(HeapScanDesc scan,
>          linesleft = scan->rs_ntuples;
>          lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
>  
> +        /* set block once per page, instead of doing so for every tuple */
> +        BlockIdSet(&tuple->t_self.ip_blkid, scan->rs_cblock);
> +
>          /* lineindex now references the next or previous visible tid */
>  continue_page:
>  
> @@ -995,7 +998,7 @@ continue_page:
>  
>              tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
>              tuple->t_len = ItemIdGetLength(lpp);
> -            ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
> +            tuple->t_self.ip_posid = lineoff;
>  
>              /* skip any tuples that don't match the scan key */
>              if (key != NULL &&

Should probably use ItemPointerSetBlockNumber and 
ItemPointerSetOffsetNumber.

> 0006 heapam: Add batch mode mvcc check and use it in page mode
> 
>      This is a good bit faster on its own, but is required to avoid a
>      performance regression later, when setting hint bits only only when no IO
>      going on at the same.

The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that 
it'd make any difference how you pass those arguments.

> +        if (likely(valid))
> +        {
> +            vistuples_dense[nvis] = tup->t_self.ip_posid;
> +            nvis++;
> +        }

Is the likely(valid) hint important for performance here? It would be a 
bad hint for a table with lots of dead tuples. I feel we'd better rely 
on the branch predictor for this one.

> 0007 bufmgr: Make it easier to change number of buffer state bits
> 
>      This just makes it easier to change the division of bits in
>      BufferDesc->state.

Looks good to me.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: AIO writes vs hint bits vs checksums

From
Heikki Linnakangas
Date:
On 24/09/2024 18:55, Andres Freund wrote:
> What I'd instead like to propose is to implement the right to set hint bits as
> a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
> named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
> BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
> BM_SETTING_HINTS to be unset to start IO.
> 
> Naively implementing this, by acquiring and releasing the permission to set
> hint bits in SetHintBits() unfortunately leads to a significant performance
> regression. While the performance is unaffected for OLTPish workloads like
> pgbench (both read and write), sequential scans of unhinted tables regress
> significantly, due to the per-tuple lock acquisition this would imply.

It'd be nice to implement this without having to set and clear 
BM_SETTING_HINTS every time. Could we put the overhead on the 
FlushBuffer() instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and 
immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the 
amount of changes could be pretty small. The missing operation we don't 
have today is reading the share-lock count on the lock in step 3. That 
seems simple to add.

Acquiring the exclusive lock in step 4 is just a way to wait for all the 
existing share-lockers to release the lock. You wouldn't need to block 
new share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we 
need, but it currently ignores share-lockers. So doing this "properly" 
would require more changes to LWLocks. Briefly acquiring an exclusive 
lock seems acceptable though.


> But: We can address this and improve performance over the status quo! Today we
> determine tuple visiblity determination one-by-one, even when checking the
> visibility of an entire page worth of tuples.

Yes, batching the visibility checks makes sense in any case.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: AIO writes vs hint bits vs checksums

From
Andres Freund
Date:
Hi,

Thanks for the quick review!


On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:
> On 24/09/2024 18:55, Andres Freund wrote:
> > What I'd instead like to propose is to implement the right to set hint bits as
> > a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
> > named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
> > BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
> > BM_SETTING_HINTS to be unset to start IO.
> >
> > Naively implementing this, by acquiring and releasing the permission to set
> > hint bits in SetHintBits() unfortunately leads to a significant performance
> > regression. While the performance is unaffected for OLTPish workloads like
> > pgbench (both read and write), sequential scans of unhinted tables regress
> > significantly, due to the per-tuple lock acquisition this would imply.
>
> It'd be nice to implement this without having to set and clear
> BM_SETTING_HINTS every time.

True - but it's worth noting that we only need to set BM_SETTING_HINTS if
there are any hint bits that need to be set. So in the most common paths we
won't need to deal with it at all.



> Could we put the overhead on the FlushBuffer()
> instead?
>
> How about something like this:
>
> To set hint bits:
>
> 1. Lock page in SHARED mode.
> 2. Read BM_IO_IN_PROGRESS
> 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
> 4. Unlock page
>
> To flush a buffer:
>
> 1. Lock page in SHARED mode
> 2. Set BM_IO_IN_PROGRESS
> 3. Read the lock count on the buffer lock, to see if we're the only locker.
> 4. If anyone else is holding the lock, upgrade it to exclusive mode, and
> immediately downgrade back to share mode.
> 5. calculate CRC, flush the buffer
> 6. Clear BM_IO_IN_PROGRESS and unlock page.
>
> This goes back to the idea of adding LWLock support for this, but the amount
> of changes could be pretty small. The missing operation we don't have today
> is reading the share-lock count on the lock in step 3. That seems simple to
> add.

I've played around with a bunch of ideas like this. There are two main
reasons I didn't like them that much in the end:

1) The worst case latency impacts seemed to make them not that
   interesting. A buffer that is heavily contended with share locks might not
   get down to zero share lockers for quite a while. That's not a problem for
   individual FlushBuffer() calls, but it could very well add up to a decent
   sized delay for something like a checkpoint that has to flush a lot of
   buffers.

   Waiting for all pre-existing share lockers is easier said than done. We
   don't record the acquisition order anywhere and a share-lock release won't
   wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
   exactly be free and the cost would be born by all lwlock users.

2) They are based on holding an lwlock. But it's actually quite conceivable
   that we'd want to set something hint-bit-like without any lock, as we
   e.g. currently do for freespace/. That would work with something like the
   approach I chose here, but not if we rely on lwlocks.

   E.g. with a bit of work we could actually do sequential scans without
   blocking concurrent buffer modifications by carefully ordering when
   pd_lower is modified and making sure that tuple header fields are written
   in a sensible order.



> Acquiring the exclusive lock in step 4 is just a way to wait for all the
> existing share-lockers to release the lock. You wouldn't need to block new
> share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but
> it currently ignores share-lockers. So doing this "properly" would require
> more changes to LWLocks. Briefly acquiring an exclusive lock seems
> acceptable though.

I don't think LW_WAIT_UNTIL_FREE is that easily extended to cover something
like this, because share lockers can acquire/release the lock without waking
anybody. And changing that would be rather expensive for everyone...


> > But: We can address this and improve performance over the status quo! Today we
> > determine tuple visiblity determination one-by-one, even when checking the
> > visibility of an entire page worth of tuples.
>
> Yes, batching the visibility checks makes sense in any case.

Cool. The wins by doing that are noticeable...

Greetings,

Andres Freund



Re: AIO writes vs hint bits vs checksums

From
Andres Freund
Date:
Hi,

On 2024-10-30 13:29:01 +0200, Heikki Linnakangas wrote:
> On 30/10/2024 04:21, Andres Freund wrote:
> > Attached is a, unfortunately long, series of patches implementing what I
> > described upthread.
> 
> Review of the preparatory patches:
> 
> > 0001 Add very basic test for kill_prior_tuples
> > 
> >      We currently don't exercise this patch for gist and hash, which seems
> >      somewhat criminal.
> 
> Interesting to use the isolationtester for this. There's just one session,
> so you're just using it to define reusable steps with handy names.

Yea. I had started out writing it as a pg_regress style test and it quickly got very
hard to understand.


> I'm fine with that, but please add a comment to explain it.

Makes sense.


> I wonder if it'd be more straightforward to make it a regular pg_regress
> test though. There would be some repetition, but would it be so bad?

I found it to be quite bad. If testing just one AM it's ok-ish, but once you
test 2-3 it gets very long and repetitive. I guess we could use functions or
such to make it a bit less painful - but that point, is it actually simpler?


> You forgot to add the new test to 'isolation_schedule'.

Oops.

> typos:
>  "inex" -> "index"
>  "does something approximately reasonble" -> "do something approximately
> reasonable"

Oops^2.


> > +/*
> > + * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
> > + * adjust hint bits. See the comment for SetHintBits() for more background.
> > + *
> > + * This helper returns false if the row ought to be invisible, true otherwise.
> > + */
> > +static inline bool
> > +HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
> > +{
> > +    TransactionId xvac;
> > +
> > +    /* only used by pre-9.0 binary upgrades */
> > +    if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
> > +        return true;
> > +
> > +     ...
> > +}
> 
> This is so unlikely these days that this function probably should not be
> inlined anymore.

Because I put the check for moved in the function it is important that it's
inlined. The compiler can understand that the rest of the function is unlikely
to be executed and put it together with other less likely to be executed code.


> > 0003 bufmgr: Add BufferLockHeldByMe()
> > 
> >      Useful for writing assertions in later patches.
> 
> > +    if (mode == BUFFER_LOCK_EXCLUSIVE)
> > +        lwmode = LW_EXCLUSIVE;
> > +    else if (mode == BUFFER_LOCK_SHARE)
> > +        lwmode = LW_SHARED;
> > +    else
> > +    {
> > +        Assert(false);
> > +        pg_unreachable();
> > +        lwmode = LW_EXCLUSIVE;  /* assuage compiler */
> > +    }
> 
> I just learned a new word :-).

Heh.


> This is fine, but to avoid the assuaging, you could also do this:
> 
> if (mode == BUFFER_LOCK_EXCLUSIVE)
>     lwmode = LW_EXCLUSIVE;
> else
> {
>     Assert(mode == BUFFER_LOCK_SHARE);
>     lwmode = LW_SHARED;
> }

I have no opinion about which version is better...


> > 0004 heapam: Use exclusive lock on old page in CLUSTER
> > 
> >      When I started looking into this I wanted to make sure that it's actually
> >      safe to skip setting hint bits in all places. One place that's *not* safe
> >      is cluster - we assume that all hint bits are set in rewrite_heap_tuple().
> > 
> >      This might not strictly be required, but it seemed very fragile as
> >      is. While using an exclusive lock does seem like it makes it a bit safer,
> >      it's not a very good fix, as we still dirty the old buffer, which seems
> >      like a shame when one uses VACUUM FULL.
> 
> > +        /*
> > +         * To be able to guarantee that we can set the hint bit, acquire an
> > +         * exclusive lock on the old buffer. We need the hint bits to be set
> > +         * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
> > +         * heap_freeze_tuple() will get confused.
> > +         *
> 
> Hmm, I don't see any comments in the reform_and_rewrite_tuple() ->
> rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint bits
> to be set.

There's this comment is in heap_prepare_freeze_tuple():

 * It is assumed that the caller has checked the tuple with
 * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
 * (else we should be removing the tuple, not freezing it).

I don't remember the precise details of what all goes wrong, I think there was
more than one avenue.

One example is stuff like this:
    /*
     * If the tuple has been updated, check the old-to-new mapping hash table.
     */
    if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
          HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
        !HeapTupleHeaderIndicatesMovedPartitions(old_tuple->t_data) &&
        !(ItemPointerEquals(&(old_tuple->t_self),
                            &(old_tuple->t_data->t_ctid))))

if the original tuple had an xmax but we didn't unset it in
HeapTupleSatisfiesVacuum(), due to not being allowed to write a hint bit,
we'll behave differently here. But at the same time
heapam_relation_copy_for_cluster() will treat the tuple as dead and call
rewrite_heap_dead_tuple().


> How does it get confused?

Corrupted visibility information after a CLUSTER.



> > 0006 heapam: Add batch mode mvcc check and use it in page mode
> > 
> >      This is a good bit faster on its own, but is required to avoid a
> >      performance regression later, when setting hint bits only only when no IO
> >      going on at the same.
> 
> The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that it'd
> make any difference how you pass those arguments.

I unfortunately found it to make a difference :(, even after trying to account
for binary layout difference by randomizing section order via linker flags.


> > +        if (likely(valid))
> > +        {
> > +            vistuples_dense[nvis] = tup->t_self.ip_posid;
> > +            nvis++;
> > +        }
> 
> Is the likely(valid) hint important for performance here? It would be a bad
> hint for a table with lots of dead tuples. I feel we'd better rely on the
> branch predictor for this one.

It did :(

Greetings,

Andres Freund



Re: AIO writes vs hint bits vs checksums

From
Jeff Davis
Date:
On Tue, 2024-09-24 at 11:55 -0400, Andres Freund wrote:
> What I suspect we might want instead is something inbetween a share
> and an
> exclusive lock, which is taken while setting a hint bit and which
> conflicts
> with having an IO in progress.

I am starting to wonder if a shared content locks are really the right
concept at all. It makes sense for simple mutexes, but we are already
more complex than that, and you are suggesting adding on to that
complexity. Which I agree is a good idea, I'm just wondering if we
could go even further.

The README states that a shared lock is necessary for visibility
checking, but can we just be more careful with the ordering and
atomicity of visibility changes in the page?

 * carefully order reads and writes of xmin/xmax/hints (would
   that create too many read barriers in the tqual.c code?)
 * write line pointer after tuple is written

We would still have pins and cleanup locks to prevent data removal.
We'd have the logic you suggest that would prevent modification during
IO. And there would still need to be an exclusive content locks so that
two inserts don't try to allocate the same line pointer, or lock the
same tuple.

If PD_ALL_VISIBLE is set it's even simpler.

Am I missing some major hazards?

Regards,
    Jeff Davis




Re: AIO writes vs hint bits vs checksums

From
Andres Freund
Date:
Hi,

On 2024-10-30 10:47:35 -0700, Jeff Davis wrote:
> On Tue, 2024-09-24 at 11:55 -0400, Andres Freund wrote:
> > What I suspect we might want instead is something inbetween a share
> > and an
> > exclusive lock, which is taken while setting a hint bit and which
> > conflicts
> > with having an IO in progress.
> 
> I am starting to wonder if a shared content locks are really the right
> concept at all. It makes sense for simple mutexes, but we are already
> more complex than that, and you are suggesting adding on to that
> complexity.

What I am proposing isn't making the content lock more complicated, it's
orthogonal to the content lock.


> Which I agree is a good idea, I'm just wondering if we could go even
> further.
> 
> The README states that a shared lock is necessary for visibility
> checking, but can we just be more careful with the ordering and
> atomicity of visibility changes in the page?
> 
>  * carefully order reads and writes of xmin/xmax/hints (would
>    that create too many read barriers in the tqual.c code?)
>  * write line pointer after tuple is written

It's possible, but it'd be a lot of work.  And you wouldn't need to just do
this for heap, but all the indexes too, to make progress on the
don't-set-hint-bits-during-io front.  So I don't think it makes sense to tie
these things together.

I do think that it's an argument for not importing all the complexity into
lwlock.c though.


> We would still have pins and cleanup locks to prevent data removal.

As-is cleanup locks only work in coordination with content locks. While
cleanup is ongoing we need to prevent anybody from starting to look at the
page - without acquiring something like a shared lock that's not easy.


> We'd have the logic you suggest that would prevent modification during
> IO. And there would still need to be an exclusive content locks so that
> two inserts don't try to allocate the same line pointer, or lock the
> same tuple.
> 
> If PD_ALL_VISIBLE is set it's even simpler.
> 
> Am I missing some major hazards?

I don't think anything fundamental, but it's a decidedly nontrivial amount of
work.

Greetings,

Andres Freund



Re: AIO writes vs hint bits vs checksums

From
Andres Freund
Date:
Hi,

On 2024-10-30 09:58:30 -0400, Andres Freund wrote:
> On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:
> > Could we put the overhead on the FlushBuffer()
> > instead?
> >
> > How about something like this:
> >
> > To set hint bits:
> >
> > 1. Lock page in SHARED mode.
> > 2. Read BM_IO_IN_PROGRESS
> > 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
> > 4. Unlock page
> >
> > To flush a buffer:
> >
> > 1. Lock page in SHARED mode
> > 2. Set BM_IO_IN_PROGRESS
> > 3. Read the lock count on the buffer lock, to see if we're the only locker.
> > 4. If anyone else is holding the lock, upgrade it to exclusive mode, and
> > immediately downgrade back to share mode.
> > 5. calculate CRC, flush the buffer
> > 6. Clear BM_IO_IN_PROGRESS and unlock page.
> >
> > This goes back to the idea of adding LWLock support for this, but the amount
> > of changes could be pretty small. The missing operation we don't have today
> > is reading the share-lock count on the lock in step 3. That seems simple to
> > add.
>
> I've played around with a bunch of ideas like this. There are two main
> reasons I didn't like them that much in the end:
>
> 1) The worst case latency impacts seemed to make them not that
>    interesting. A buffer that is heavily contended with share locks might not
>    get down to zero share lockers for quite a while. That's not a problem for
>    individual FlushBuffer() calls, but it could very well add up to a decent
>    sized delay for something like a checkpoint that has to flush a lot of
>    buffers.
>
>    Waiting for all pre-existing share lockers is easier said than done. We
>    don't record the acquisition order anywhere and a share-lock release won't
>    wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
>    exactly be free and the cost would be born by all lwlock users.
>
> 2) They are based on holding an lwlock. But it's actually quite conceivable
>    that we'd want to set something hint-bit-like without any lock, as we
>    e.g. currently do for freespace/. That would work with something like the
>    approach I chose here, but not if we rely on lwlocks.
>
>    E.g. with a bit of work we could actually do sequential scans without
>    blocking concurrent buffer modifications by carefully ordering when
>    pd_lower is modified and making sure that tuple header fields are written
>    in a sensible order.

I still don't like this idea a whole lot - but perhaps we could get reduce the
overhead of my proposal some, to get closer to yours. When setting hint bits
for many tuples on a page the overhead of my approach is neglegible, but when doing it
for individual tuples it's a bit less neglegible.

We can reduce the efficiency difference substantially by adding a bufmgr.c API
that set hints on a page. That function can set the hint bit while holding the
buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and
thus also doesn't need to do resowner.c accounting.

To see the worst case overhead, I
a) disabling the "batch" optimization
b) disabled checksums, as that otherwise would hide small efficiency
   differences
c) used an unlogged table

and measured the performance difference for a previously-unhinted sequential
scan of a narrow table that immediately discards all tuples due to OFFSET -
afaict the worst case for the proposed new behaviour.

Previously this was 30.8% slower than master. Now it's only 1.9% slower.

With the batch optimization enabled, the patchset is 7.5% faster.


I also looked at the performance impact on scans that cannot use the batched
approach. The worst case I could think of was a large ordered indexscan of a
previously unhinted table.

For an IOS, the performance difference is a slowdown of 0.65%.

But the difference being so small is partially just due to IOS being
considerably slower than a plain index scan when all tuples need to be fetched
from the table (we need to address that...). Forcing a non-IOS IOS scan using
enable_indexonlyscan, I get a slowdown of 5.0%.


Given that this is an intentionally pathological workload - there's no IO /
the workload is CPU bound, yet the data is only ever accessed once, with a
query that doesn't ever actually look at the data - I'm quite happy with that.

As soon as I make it a query that actually uses the data, the difference
vanishes in the noise.

Greetings,

Andres Freund