Thread: Use read streams in pg_visibility

Use read streams in pg_visibility

From
Nazir Bilal Yavuz
Date:
Hi,

I am working on using the read stream in pg_visibility. There are two
places to use it:

1- collect_visibility_data()

This one is straightforward. I created a read stream object if
'include_pd' is true because read I/O is done when 'include_pd' is
true. There is ~4% timing improvement with this change. I started the
server with the default settings and created a 6 GB table. Then run
100 times pg_visibility() by clearing the OS cache between each run.
----------

2- collect_corrupt_items()

This one is more complicated. The read stream callback function loops
until it finds a suitable block to read. So, if the callback returns
an InvalidBlockNumber; it means that the stream processed all possible
blocks and the stream should be finished. There is ~3% timing
improvement with this change. I started the server with the default
settings and created a 6 GB table. Then run 100 times
pg_check_visible() by clearing the OS cache between each run.

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.
----------

Both patches are attached.

Any feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> 2- collect_corrupt_items()
> 
> This one is more complicated. The read stream callback function loops
> until it finds a suitable block to read. So, if the callback returns
> an InvalidBlockNumber; it means that the stream processed all possible
> blocks and the stream should be finished. There is ~3% timing
> improvement with this change. I started the server with the default
> settings and created a 6 GB table. Then run 100 times
> pg_check_visible() by clearing the OS cache between each run.
> 
> The downside of this approach is there are too many "vmbuffer is valid
> but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> vmbuffer needs to be read again" cases in the read stream version (700
> vs 20 for the 6 GB table). This is caused by the callback function of
> the read stream reading a new vmbuf while getting next block numbers.
> So, vmbuf is wrong when we are checking visibility map bits that might
> have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread?  Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set.  In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer.  Much of the time,
they'll be the same buffer.  It could be as simple as that, but you could
consider further optimizations like these:

- Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
  lookup when the other Buffer var already has the block you want.

- [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
  to stop calling the ReadStreamBlockNumberCB for awhile.  This could help if
  nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
  block.  For example, if one block out of every 33000 is all-visible, every
  heap block we visit has a different vmbuffer.  It's likely not optimal for
  the callback to pin and unpin 20 vmbuffers, then have
  collect_corrupt_items() pin and unpin the same 20 vmbuffers.  pg_visibility
  could notice this trend and request a stop of the callbacks until more of
  the heap block work completes.  If pg_visibility is going to be the only
  place in the code with this use case, it's probably not worth carrying the
  extra API just for pg_visibility.  However, if we get a stronger use case
  later, pg_visibility could be another beneficiary.

> +/*
> + * Callback function to get next block for read stream object used in
> + * collect_visibility_data() function.
> + */
> +static BlockNumber
> +collect_visibility_data_read_stream_next_block(ReadStream *stream,
> +                                               void *callback_private_data,
> +                                               void *per_buffer_data)
> +{
> +    struct collect_visibility_data_read_stream_private *p = callback_private_data;
> +
> +    if (p->blocknum < p->nblocks)
> +        return p->blocknum++;
> +
> +    return InvalidBlockNumber;

This is the third callback that just iterates over a block range, after
pg_prewarm_read_stream_next_block() and
copy_storage_using_buffer_read_stream_next_block().  While not a big problem,
I think it's time to have a general-use callback for block range scans.  The
quantity of duplicate code is low, but the existing function names are long
and less informative than a behavior-based name.

> +static BlockNumber
> +collect_corrupt_items_read_stream_next_block(ReadStream *stream,
> +                                             void *callback_private_data,
> +                                             void *per_buffer_data)
> +{
> +    struct collect_corrupt_items_read_stream_private *p = callback_private_data;
> +
> +    for (; p->blocknum < p->nblocks; p->blocknum++)
> +    {
> +        bool        check_frozen = false;
> +        bool        check_visible = false;
> +
> +        if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
> +            check_frozen = true;
> +        if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
> +            check_visible = true;
> +        if (!check_visible && !check_frozen)
> +            continue;

If a vm has zero bits set, this loop will scan the entire vm before returning.
Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how
VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS().

> @@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
>      items->count = 64;
>      items->tids = palloc(items->count * sizeof(ItemPointerData));
>  
> +    p.blocknum = 0;
> +    p.nblocks = nblocks;
> +    p.rel = rel;
> +    p.vmbuffer = &vmbuffer;
> +    p.all_frozen = all_frozen;
> +    p.all_visible = all_visible;
> +    stream = read_stream_begin_relation(READ_STREAM_FULL,
> +                                        bstrategy,
> +                                        rel,
> +                                        MAIN_FORKNUM,
> +                                        collect_corrupt_items_read_stream_next_block,
> +                                        &p,
> +                                        0);
> +
>      /* Loop over every block in the relation. */
>      for (blkno = 0; blkno < nblocks; ++blkno)

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate.  I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block.  If that's
what v1 does, could you expand the test file to catch that?  For example, make
a two-block table with only the second block all-visible.

Since the callback skips blocks, this function should use the
read_stream_reset() style of looping:

    while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer) ...

Thanks,
nm



Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > > The downside of this approach is there are too many "vmbuffer is valid
> > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > > vmbuffer needs to be read again" cases in the read stream version (700
> > > vs 20 for the 6 GB table). This is caused by the callback function of
> > > the read stream reading a new vmbuf while getting next block numbers.
> > > So, vmbuf is wrong when we are checking visibility map bits that might
> > > have changed while we were acquiring the page lock.
> >
> > Did the test that found 700 "read again" use a different procedure than the
> > one shared as setup.sh down-thread?  Using that script alone, none of the vm
> > bits are set, so I'd not expect any heap page reads.
> 
> Yes, it is basically the same script but the query is 'SELECT
> pg_check_visible('${TABLE}'::regclass);'.

I gather the scripts deal exclusively in tables with no vm bits set, so
pg_visibility did no heap page reads under those scripts.  But the '700 "read
again"' result suggests either I'm guessing wrong, or that result came from a
different test procedure.  Thoughts?

> > The "vmbuffer needs to be read again" regression fits what I would expect the
> > v1 patch to do with a table having many vm bits set.  In general, I think the
> > fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
> > while collect_corrupt_items() works on another vmbuffer.  Much of the time,
> > they'll be the same buffer.  It could be as simple as that, but you could
> > consider further optimizations like these:
> 
> Thanks for the suggestion. Keeping two vmbuffers lowered the count of
> "read-again" cases to ~40. I ran the test again and the timing
> improvement is ~5% now.

> I think the number of "read-again" cases are low enough now.

Agreed.  The increase from 20 to 40 is probably an increase in buffer mapping
lookups, not an increase in I/O.

> Does creating something like
> pg_general_read_stream_next_block() in read_stream code and exporting
> it makes sense?

To me, that name isn't conveying the function's behavior, and the words "pg_"
and "general_" aren't adding information there.  How about one of these names
or similar:

seq_read_stream_cb
sequential_read_stream_cb
block_range_read_stream_cb

> > The callback doesn't return blocks having zero vm bits, so the blkno variable
> > is not accurate.  I didn't test, but I think the loop's "Recheck to avoid
> > returning spurious results." looks at the bit for the wrong block.  If that's
> > what v1 does, could you expand the test file to catch that?  For example, make
> > a two-block table with only the second block all-visible.
> 
> Yes, it was not accurate. I am getting blockno from the buffer now. I
> checked and confirmed it is working as expected by manually logging
> blocknos returned from the read stream. I am not sure how to add a
> test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table.  DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page.  Past tests
like that had instability problems.  One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits.  Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly.  Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that.  It may not be problem for this particular test.

> There is one behavioral change introduced with the patches. It could
> happen when collect_corrupt_items() is called with both all_visible
> and all_frozen being true.
> -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
> false in callback. Callback returns this block number because
> VM_ALL_FROZEN() is true.
> -> At the /* Recheck to avoid returning spurious results. */ part, we
> should only check VM_ALL_FROZEN() again as this was returning true in
> the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
> VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
> (vice versa of callback).
> -> We were skipping this block in the master but the patched version
> does not skip that.
> 
> Is this a problem? If this is a problem, a solution might be to
> callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
> per_buffer_data.

No, I don't consider that a problem.  That's not making us do additional I/O,
just testing more bits within the pages we're already loading.  The difference
in behavior is user-visible, but it's the same behavior change the user would
get if the timing had been a bit different.



Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Tue, Aug 27, 2024 at 10:49:19AM +0300, Nazir Bilal Yavuz wrote:
> On Fri, 23 Aug 2024 at 22:01, Noah Misch <noah@leadboat.com> wrote:
> > On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> > > On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:
> > > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

> I liked the block_range_read_stream_cb. Attached patches for that
> (first 3 patches). I chose an nblocks way instead of last_blocks in
> the struct.

To read blocks 10 and 11, I would expect to initialize the struct with one of:

{ .first=10, .nblocks=2 }
{ .first=10, .last_inclusive=11 }
{ .first=10, .last_exclusive=12 }

With the patch's API, I would need {.first=10,.nblocks=12}.  The struct field
named "nblocks" behaves like a last_block_exclusive.  Please either make the
behavior an "nblocks" behavior or change the field name to replace the term
"nblocks" with something matching the behavior.  (I used longer field names in
my examples here, to disambiguate those examples.  It's okay if the final
field names aren't those, as long as the field names and the behavior align.)

> > > > The callback doesn't return blocks having zero vm bits, so the blkno variable
> > > > is not accurate.  I didn't test, but I think the loop's "Recheck to avoid
> > > > returning spurious results." looks at the bit for the wrong block.  If that's
> > > > what v1 does, could you expand the test file to catch that?  For example, make
> > > > a two-block table with only the second block all-visible.
> > >
> > > Yes, it was not accurate. I am getting blockno from the buffer now. I
> > > checked and confirmed it is working as expected by manually logging
> > > blocknos returned from the read stream. I am not sure how to add a
> > > test case for this.
> >
> > VACUUM FREEZE makes an all-visible, all-frozen table.  DELETE of a particular
> > TID, even if rolled back, clears both vm bits for the TID's page.  Past tests
> > like that had instability problems.  One cause is a concurrent session's XID
> > or snapshot, which can prevent VACUUM setting vm bits.  Using a TEMP table may
> > have been one of the countermeasures, but I don't remember clearly.  Hence,
> > please search the archives or the existing pg_visibility tests for how we
> > dealt with that.  It may not be problem for this particular test.
> 
> Thanks for the information, I will check these. What I still do not
> understand is how to make sure that only the second block is processed
> and the first one is skipped. pg_check_visible() and pg_check_frozen()
> returns TIDs that cause corruption in the visibility map, there is no
> information about block numbers.

I see what you're saying.  collect_corrupt_items() needs a corrupt table to
report anything; all corruption-free tables get the same output.  Testing this
would need extra C code or techniques like corrupt_page_checksum() to create
the corrupt state.  That wouldn't be a bad thing to have, but it's big enough
that I'll consider it out of scope for $SUBJECT.  With the callback change
above, I'll be ready to push all this.



Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:
> Thanks, updated patches are attached.

> +/*
> + * Ask the callback which block it would like us to read next, with a small
> + * buffer in front to allow read_stream_unget_block() to work and to allow the
> + * fast path to skip this function and work directly from the array.
>   */
>  static inline BlockNumber
>  read_stream_get_block(ReadStream *stream, void *per_buffer_data)

v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
this update to the read_stream_get_block() header comment, but we're not
changing read_stream_get_block() here.  I reverted this.

Pushed with some other cosmetic changes.



Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:
> On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:
> > Thanks, updated patches are attached.
> 
> > +/*
> > + * Ask the callback which block it would like us to read next, with a small
> > + * buffer in front to allow read_stream_unget_block() to work and to allow the
> > + * fast path to skip this function and work directly from the array.
> >   */
> >  static inline BlockNumber
> >  read_stream_get_block(ReadStream *stream, void *per_buffer_data)
> 
> v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
> this update to the read_stream_get_block() header comment, but we're not
> changing read_stream_get_block() here.  I reverted this.
> 
> Pushed with some other cosmetic changes.

I see I pushed something unacceptable under ASAN.  I will look into that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20



Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Tue, Sep 03, 2024 at 10:46:34PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 3 Sept 2024 at 22:20, Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:
> > > Pushed with some other cosmetic changes.
> 
> Thanks!
> 
> > I see I pushed something unacceptable under ASAN.  I will look into that:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20
> 
> I think it is related to the scope of BlockRangeReadStreamPrivate in
> the collect_visibility_data() function. Attached a small fix for that.

Right.

https://postgr.es/m/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption.  I have pushed revert ddfc556 of the pg_visibility.c changes.  For
the next try, could you add that testing we discussed?



Re: Use read streams in pg_visibility

From
Nazir Bilal Yavuz
Date:
Hi,

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:
>
> https://postgr.es/m/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
> then observed that collect_corrupt_items() was now guaranteed to never detect
> corruption.  I have pushed revert ddfc556 of the pg_visibility.c changes.  For
> the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:
> > https://postgr.es/m/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
> > then observed that collect_corrupt_items() was now guaranteed to never detect
> > corruption.  I have pushed revert ddfc556 of the pg_visibility.c changes.  For
> > the next try, could you add that testing we discussed?
> 
> Do you think that corrupting the visibility map and then seeing if
> pg_check_visible() and pg_check_frozen() report something is enough?

I think so.  Please check that it would have caught both the blkno bug and the
above bug.



Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:
> Your patch is correct. I wrongly assumed it would catch blockno bug,
> the attached version catches it. I made blockno = 0 invisible and not
> frozen before copying the vm file. So, in the blockno buggy version;
> callback will skip that block but the main loop in the
> collect_corrupt_items() will not skip it. I tested it with your patch
> and there is exactly 1 blockno difference between expected and result
> output.

Pushed.  I added autovacuum=off so auto-analyze of a system catalog can't take
a snapshot that blocks VACUUM updating the vismap.  I doubt that could happen
under default settings, but this lets us disregard the possibility entirely.

I also fixed the mix of tabs and spaces inside test file string literals.



Re: Use read streams in pg_visibility

From
Nazir Bilal Yavuz
Date:
Hi,

On Wed, 11 Sept 2024 at 01:38, Noah Misch <noah@leadboat.com> wrote:
>
> On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:
> > Your patch is correct. I wrongly assumed it would catch blockno bug,
> > the attached version catches it. I made blockno = 0 invisible and not
> > frozen before copying the vm file. So, in the blockno buggy version;
> > callback will skip that block but the main loop in the
> > collect_corrupt_items() will not skip it. I tested it with your patch
> > and there is exactly 1 blockno difference between expected and result
> > output.
>
> Pushed.  I added autovacuum=off so auto-analyze of a system catalog can't take
> a snapshot that blocks VACUUM updating the vismap.  I doubt that could happen
> under default settings, but this lets us disregard the possibility entirely.

Thanks!

> I also fixed the mix of tabs and spaces inside test file string literals.

I ran both pgindent and pgperltidy but they didn't catch them. Is
there an automated way to catch them?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Use read streams in pg_visibility

From
Noah Misch
Date:
On Wed, Sep 11, 2024 at 09:19:09AM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 11 Sept 2024 at 01:38, Noah Misch <noah@leadboat.com> wrote:
> > I also fixed the mix of tabs and spaces inside test file string literals.
> 
> I ran both pgindent and pgperltidy but they didn't catch them. Is
> there an automated way to catch them?

git diff --check