Thread: Streaming read-ready sequential scan code

Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
Hi,

Last year, David and I worked on a round of refactoring for
heapgettup() and heapgettup_pagemode() [1]. Now that the streaming
read API has been proposed [2], there is a bit more refactoring that
can be done on master to prepare sequential scan to support streaming
reads.

Patches 0001 and 0002 in the attached patchset do this new round of
refactoring. 0003 is the remainder of the streaming read API that is
not yet in master. 0004 is the sequential scan streaming read user.

The primary change needed to be able to drop in streaming read support
was that heapgettup() and heapgettup_pagemode() have to wait for there
to be no more valid buffers instead of waiting until there were no
more valid BlockNumbers to know that the relation has been entirely
processed. Naturally, streaming reads prefetch ahead of the block
being currently processed by the scan, so all blocks should have been
requested long before all blocks have been processed.

To change this, I split up heapgetpage() into two functions -- one
responsible for getting blocks into buffers and the other for
processing a page (pruning, checking tuple visibility, etc). As a
consequence, I had to change the other caller of heapgetpage() (sample
scans). Since I was doing this anyway, I made a few changes there. It
is arguable that those changes could be split up differently between
0001 and 0004. However, I wanted 0004 to be *only* the sequential scan
streaming read user code.

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans (see TODO in 0004).
However, I thought I would keep this thread focused on 0001 and 0002.

Though logically the performance with 0001 and 0002 should be the same
as master (no new non-inline function calls, no additional looping),
I've done a bit of profiling anyway. I created a large multi-GB table,
read it all into shared buffers (disabling the large sequential scan
bulkread optimization), and did a sequential SELECT count(*) from the
table. From the profiles below, you'll notice that master and the
patch are basically the same. Actual percentages vary from run-to-run.
Execution time is the same.

patch
  15.49%  postgres  postgres           [.] ExecInterpExpr
  11.03%  postgres  postgres           [.] heapgettup_pagemode
  10.85%  postgres  postgres           [.] ExecStoreBufferHeapTuple
   9.14%  postgres  postgres           [.] heap_getnextslot
   8.39%  postgres  postgres           [.] heapbuildvis
   6.47%  postgres  postgres           [.] SeqNext

master
  14.16%  postgres  postgres           [.] ExecInterpExpr
  11.54%  postgres  postgres           [.] heapgettup_pagemode
  10.63%  postgres  postgres           [.] ExecStoreBufferHeapTuple
  10.22%  postgres  postgres           [.] heap_getnextslot
   8.53%  postgres  postgres           [.] heapgetpage
   5.35%  postgres  postgres           [.] SeqNext

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

Attachment

Re: Streaming read-ready sequential scan code

From
David Rowley
Date:
On Tue, 30 Jan 2024 at 10:17, Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Though logically the performance with 0001 and 0002 should be the same
> as master (no new non-inline function calls, no additional looping),
> I've done a bit of profiling anyway. I created a large multi-GB table,
> read it all into shared buffers (disabling the large sequential scan
> bulkread optimization), and did a sequential SELECT count(*) from the
> table. From the profiles below, you'll notice that master and the
> patch are basically the same. Actual percentages vary from run-to-run.
> Execution time is the same.

Can you also run a test on a Seqscan with a filter that filters out
all tuples?  There's less overhead in other parts of the executor with
such a query.

David



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Mon, Jan 29, 2024 at 4:24 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 30 Jan 2024 at 10:17, Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Though logically the performance with 0001 and 0002 should be the same
> > as master (no new non-inline function calls, no additional looping),
> > I've done a bit of profiling anyway. I created a large multi-GB table,
> > read it all into shared buffers (disabling the large sequential scan
> > bulkread optimization), and did a sequential SELECT count(*) from the
> > table. From the profiles below, you'll notice that master and the
> > patch are basically the same. Actual percentages vary from run-to-run.
> > Execution time is the same.
>
> Can you also run a test on a Seqscan with a filter that filters out
> all tuples?  There's less overhead in other parts of the executor with
> such a query.

Yes, of course. Thank you so much for taking a look!

While I was at it, I changed the table schema to be entirely composed
of INT type columns and regenerated the data. Note that, both in this
example and my previous example, I ensured that the table was vacuumed
beforehand (and autovacuum disabled for the table) so there wasn't any
on-access pruning happening (heapgetpage() does that in pagemode).

This is the schema
  CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f INT, g
INT) with (autovacuum_enabled=false);

I added 46000000 rows to the table, making it 2.6 GB. Shared buffers
is double that. Before profiling, I did a SELECT * from the table with
the large sequential scan bulkread optimization disabled. Then I
vacuumed the table. Finally, I turned up parallel_setup_cost high
enough to disable query parallelism.

The query I profiled was:
SELECT * FROM foo WHERE id = 0;
With the data I generated, 0 rows match that condition.

Profiles below. Execution time essentially the same.

patch:
  17.08%  postgres  postgres           [.] ExecInterpExpr
  11.17%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
  10.64%  postgres  postgres           [.] ExecStoreBufferHeapTuple
   9.82%  postgres  postgres           [.] heap_getnextslot
   9.13%  postgres  postgres           [.] heapgettup_pagemode
   8.98%  postgres  postgres           [.] heapbuildvis
   5.40%  postgres  postgres           [.] HeapCheckForSerializableConflictOut
   5.16%  postgres  postgres           [.] SeqNext

master:
  17.89%  postgres  postgres           [.] ExecInterpExpr
  12.28%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
  10.54%  postgres  postgres           [.] ExecStoreBufferHeapTuple
  10.11%  postgres  postgres           [.] heapgettup_pagemode
   8.52%  postgres  postgres           [.] heapgetpage
   8.28%  postgres  postgres           [.] heap_getnextslot
   5.00%  postgres  postgres           [.] HeapCheckForSerializableConflictOut
   4.71%  postgres  postgres           [.] SeqNext

- Melanie



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> There is an outstanding question about where to allocate the
> PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Option B) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_heapgettup_alloc_forward_only
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context. It
only allocates the streaming read object if it is a forwards scan. It
frees the streaming read object if the scan direction is later
changed.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Option C) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_all_dir_stream
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context.
- All scan directions support streaming. To do this, the scan
direction has to be tracked and we must check if the direction has
changed on every heapgettup[_pagemode]() invocation to avoid returning
wrong results.
- No "fallback" codepath as all heap sequential scans will use the
streaming read API

As you can see, each option has pros and cons. I'm interested in what
others think about which we should choose.

- Melanie



Re: Streaming read-ready sequential scan code

From
Robert Haas
Date:
On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I've written three alternative implementations of the actual streaming
> read user for sequential scan which handle the question of where to
> allocate the streaming read object and how to handle changing scan
> direction in different ways.

It's weird to me that the prospect of changing the scan direction
causes such complexity. I mean, why doesn't a streaming read object
have a forget_all_my_previous_requests() method or somesuch?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Tue, Feb 20, 2024 at 6:13 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I've written three alternative implementations of the actual streaming
> > read user for sequential scan which handle the question of where to
> > allocate the streaming read object and how to handle changing scan
> > direction in different ways.
>
> It's weird to me that the prospect of changing the scan direction
> causes such complexity. I mean, why doesn't a streaming read object
> have a forget_all_my_previous_requests() method or somesuch?

Basically, that is what pg_streaming_read_free() does. It goes through
and releases the buffers it had pinned and frees any per-buffer data
allocated.

The complexity with the sequential scan streaming read user and scan
direction is just that it has to detect when the scan direction
changes and do the releasing/freeing and reallocation. The scan
direction is passed to heapgettup[_pagemode](), so this is something
that can change on a tuple-to-tuple basis.

It is less that doing this is complicated and more that it is annoying
and distracting to have to check for and handle a very unimportant and
uncommon case in the main path of the common case.

- Melanie



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > There is an outstanding question about where to allocate the
> > PgStreamingRead object for sequential scans
>
> I've written three alternative implementations of the actual streaming
> read user for sequential scan which handle the question of where to
> allocate the streaming read object and how to handle changing scan
> direction in different ways.
>
> Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> - Allocates the streaming read object in initscan(). Since we do not
> know the scan direction at this time, if the scan ends up not being a
> forwards scan, the streaming read object must later be freed -- so
> this will sometimes allocate a streaming read object it never uses.
> - Only supports ForwardScanDirection and once the scan direction
> changes, streaming is never supported again -- even if we return to
> ForwardScanDirection
> - Must maintain a "fallback" codepath that does not use the streaming read API

Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.

- Melanie

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > There is an outstanding question about where to allocate the
> > > PgStreamingRead object for sequential scans
> >
> > I've written three alternative implementations of the actual streaming
> > read user for sequential scan which handle the question of where to
> > allocate the streaming read object and how to handle changing scan
> > direction in different ways.
> >
> > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > - Allocates the streaming read object in initscan(). Since we do not
> > know the scan direction at this time, if the scan ends up not being a
> > forwards scan, the streaming read object must later be freed -- so
> > this will sometimes allocate a streaming read object it never uses.
> > - Only supports ForwardScanDirection and once the scan direction
> > changes, streaming is never supported again -- even if we return to
> > ForwardScanDirection
> > - Must maintain a "fallback" codepath that does not use the streaming read API
> 
> Attached is a version of this patch which implements a "reset"
> function for the streaming read API which should be cheaper than the
> full pg_streaming_read_free() on rescan. This can easily be ported to
> work on any of my proposed implementations (A/B/C). I implemented it
> on A as an example.

Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.

The streaming read is allocated in heap_beginscan() and then reset on
rescan and when the scan direction changes. I only check if the scan
direction changes when a new page is needed. This implementation means
no fallback method is needed, so we can remove the non-streaming read
code for heap sequential scans.

Because heapgettup() and heapgettup_pagemode() are also used for TID
range scans, this patch also happens to implement streaming reads for
TID range scans.

- Melanie

[1] https://www.postgresql.org/message-id/CA%2BhUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ%40mail.gmail.com

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > <melanieplageman@gmail.com> wrote:
> > > >
> > > > There is an outstanding question about where to allocate the
> > > > PgStreamingRead object for sequential scans
> > >
> > > I've written three alternative implementations of the actual streaming
> > > read user for sequential scan which handle the question of where to
> > > allocate the streaming read object and how to handle changing scan
> > > direction in different ways.
> > >
> > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > - Allocates the streaming read object in initscan(). Since we do not
> > > know the scan direction at this time, if the scan ends up not being a
> > > forwards scan, the streaming read object must later be freed -- so
> > > this will sometimes allocate a streaming read object it never uses.
> > > - Only supports ForwardScanDirection and once the scan direction
> > > changes, streaming is never supported again -- even if we return to
> > > ForwardScanDirection
> > > - Must maintain a "fallback" codepath that does not use the streaming read API
> >
> > Attached is a version of this patch which implements a "reset"
> > function for the streaming read API which should be cheaper than the
> > full pg_streaming_read_free() on rescan. This can easily be ported to
> > work on any of my proposed implementations (A/B/C). I implemented it
> > on A as an example.
>
> Attached is the latest version of this patchset -- rebased in light of
> Thomas' updatees to the streaming read API [1]. I have chosen the
> approach I think we should go with. It is a hybrid of my previously
> proposed approaches.

While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.

- Melanie

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > > <melanieplageman@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > > <melanieplageman@gmail.com> wrote:
> > > > >
> > > > > There is an outstanding question about where to allocate the
> > > > > PgStreamingRead object for sequential scans
> > > >
> > > > I've written three alternative implementations of the actual streaming
> > > > read user for sequential scan which handle the question of where to
> > > > allocate the streaming read object and how to handle changing scan
> > > > direction in different ways.
> > > >
> > > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > know the scan direction at this time, if the scan ends up not being a
> > > > forwards scan, the streaming read object must later be freed -- so
> > > > this will sometimes allocate a streaming read object it never uses.
> > > > - Only supports ForwardScanDirection and once the scan direction
> > > > changes, streaming is never supported again -- even if we return to
> > > > ForwardScanDirection
> > > > - Must maintain a "fallback" codepath that does not use the streaming read API
> > >
> > > Attached is a version of this patch which implements a "reset"
> > > function for the streaming read API which should be cheaper than the
> > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > work on any of my proposed implementations (A/B/C). I implemented it
> > > on A as an example.
> >
> > Attached is the latest version of this patchset -- rebased in light of
> > Thomas' updatees to the streaming read API [1]. I have chosen the
> > approach I think we should go with. It is a hybrid of my previously
> > proposed approaches.
> 
> While investigating some performance concerns, Andres pointed out that
> the members I add to HeapScanDescData in this patch push rs_cindex and
> rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> v4's HeapScanDescData is as well-packed as on master and its members
> are reordered so that rs_cindex and rs_ntuples are back on the second
> cacheline of the struct's data.

I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.

Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.

- Melanie

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > > > <melanieplageman@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > > > <melanieplageman@gmail.com> wrote:
> > > > > >
> > > > > > There is an outstanding question about where to allocate the
> > > > > > PgStreamingRead object for sequential scans
> > > > >
> > > > > I've written three alternative implementations of the actual streaming
> > > > > read user for sequential scan which handle the question of where to
> > > > > allocate the streaming read object and how to handle changing scan
> > > > > direction in different ways.
> > > > >
> > > > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > > know the scan direction at this time, if the scan ends up not being a
> > > > > forwards scan, the streaming read object must later be freed -- so
> > > > > this will sometimes allocate a streaming read object it never uses.
> > > > > - Only supports ForwardScanDirection and once the scan direction
> > > > > changes, streaming is never supported again -- even if we return to
> > > > > ForwardScanDirection
> > > > > - Must maintain a "fallback" codepath that does not use the streaming read API
> > > >
> > > > Attached is a version of this patch which implements a "reset"
> > > > function for the streaming read API which should be cheaper than the
> > > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > > work on any of my proposed implementations (A/B/C). I implemented it
> > > > on A as an example.
> > >
> > > Attached is the latest version of this patchset -- rebased in light of
> > > Thomas' updatees to the streaming read API [1]. I have chosen the
> > > approach I think we should go with. It is a hybrid of my previously
> > > proposed approaches.
> >
> > While investigating some performance concerns, Andres pointed out that
> > the members I add to HeapScanDescData in this patch push rs_cindex and
> > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> > v4's HeapScanDescData is as well-packed as on master and its members
> > are reordered so that rs_cindex and rs_ntuples are back on the second
> > cacheline of the struct's data.
>
> I did some additional profiling and realized that dropping the
> unlikely() from the places we check rs_inited frequently was negatively
> impacting performance. v5 adds those back and also makes a few other
> very minor cleanups.
>
> Note that this patch set has a not yet released version of Thomas
> Munro's Streaming Read API with a new ramp-up logic which seems to fix a
> performance issue I saw with my test case when all of the sequential
> scan's blocks are in shared buffers. Once he sends the official new
> version, I will rebase this and point to his explanation in that thread.

Attached v6 has the version of the streaming read API mentioned here
[1]. This resolved the fully-in-shared-buffers regressions
investigated in that thread by Andres, Bilal, and Thomas.

The one outstanding item for the sequential scan streaming read user
is deciding how the BAS_BULKREAD buffer access strategy should
interact with the streaming read infrastructure. We discussed a bit
off-list, and it seems clear that the ring must be at least as large
as io_combine_limit. This should be no problem for BAS_BULKREAD
because its ring is 16 MB. The question is whether or not we need to
do anything right now to ensure there aren't adverse interactions
between io_combine_limit, max_ios, and the buffer access strategy ring
buffer size.

- Melanie

[1] https://www.postgresql.org/message-id/CA%2BhUKGJTwrS7F%3DuJPx3SeigMiQiW%2BLJaOkjGyZdCntwyMR%3DuAw%40mail.gmail.com

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Wed, Mar 27, 2024 at 08:47:03PM -0400, Melanie Plageman wrote:
> On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> > > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
> > > <melanieplageman@gmail.com> wrote:
> > > >
> > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > > > > <melanieplageman@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > > > > <melanieplageman@gmail.com> wrote:
> > > > > > >
> > > > > > > There is an outstanding question about where to allocate the
> > > > > > > PgStreamingRead object for sequential scans
> > > > > >
> > > > > > I've written three alternative implementations of the actual streaming
> > > > > > read user for sequential scan which handle the question of where to
> > > > > > allocate the streaming read object and how to handle changing scan
> > > > > > direction in different ways.
> > > > > >
> > > > > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > > > know the scan direction at this time, if the scan ends up not being a
> > > > > > forwards scan, the streaming read object must later be freed -- so
> > > > > > this will sometimes allocate a streaming read object it never uses.
> > > > > > - Only supports ForwardScanDirection and once the scan direction
> > > > > > changes, streaming is never supported again -- even if we return to
> > > > > > ForwardScanDirection
> > > > > > - Must maintain a "fallback" codepath that does not use the streaming read API
> > > > >
> > > > > Attached is a version of this patch which implements a "reset"
> > > > > function for the streaming read API which should be cheaper than the
> > > > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > > > work on any of my proposed implementations (A/B/C). I implemented it
> > > > > on A as an example.
> > > >
> > > > Attached is the latest version of this patchset -- rebased in light of
> > > > Thomas' updatees to the streaming read API [1]. I have chosen the
> > > > approach I think we should go with. It is a hybrid of my previously
> > > > proposed approaches.
> > >
> > > While investigating some performance concerns, Andres pointed out that
> > > the members I add to HeapScanDescData in this patch push rs_cindex and
> > > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> > > v4's HeapScanDescData is as well-packed as on master and its members
> > > are reordered so that rs_cindex and rs_ntuples are back on the second
> > > cacheline of the struct's data.
> >
> > I did some additional profiling and realized that dropping the
> > unlikely() from the places we check rs_inited frequently was negatively
> > impacting performance. v5 adds those back and also makes a few other
> > very minor cleanups.
> >
> > Note that this patch set has a not yet released version of Thomas
> > Munro's Streaming Read API with a new ramp-up logic which seems to fix a
> > performance issue I saw with my test case when all of the sequential
> > scan's blocks are in shared buffers. Once he sends the official new
> > version, I will rebase this and point to his explanation in that thread.
> 
> Attached v6 has the version of the streaming read API mentioned here
> [1]. This resolved the fully-in-shared-buffers regressions
> investigated in that thread by Andres, Bilal, and Thomas.

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.

I noticed that 0001 in the set posed a small regression from master for
a sequential scan of a relation already in shared buffers. While
investigating this, I saw that heapfetchbuf() was still not being
inlined (compiled at -O2) and when I promoted heapfetchbuf() from static
inline to static pg_attribute_always_inline, most of the very small
regression I saw went away. I don't know if I squashed the issue
entirely, though.

- Melanie

Attachment

Re: Streaming read-ready sequential scan code

From
Heikki Linnakangas
Date:
On 01/04/2024 22:58, Melanie Plageman wrote:
> Attached v7 has version 14 of the streaming read API as well as a few
> small tweaks to comments and code.

I saw benchmarks in this thread to show that there's no regression when 
the data is in cache, but I didn't see any benchmarks demonstrating the 
benefit of this. So I ran this quick test:

-- create table ~1 GB table with only 1 row per page.
CREATE TABLE giga (i int, filler text) with (fillfactor=10);
insert into giga select g, repeat('x', 900) from generate_series(1, 
140000) g;
vacuum freeze giga;

\timing on
select count(*) from giga;

The SELECT takes about 390 ms on 'master', and 230 ms with the patch.

This is pretty much the best case for this patch, real world gains will 
be much smaller. Nevertheless, nice speedup!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 01/04/2024 22:58, Melanie Plageman wrote:
> > Attached v7 has version 14 of the streaming read API as well as a few
> > small tweaks to comments and code.
>
> I saw benchmarks in this thread to show that there's no regression when
> the data is in cache, but I didn't see any benchmarks demonstrating the
> benefit of this. So I ran this quick test:

Good point! It would be good to show why we would actually want this
patch. Attached v8 is rebased over current master (which now has the
streaming read API).

On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.

    Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);

Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.

- Melanie

Attachment

Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > On 01/04/2024 22:58, Melanie Plageman wrote:
> > > Attached v7 has version 14 of the streaming read API as well as a few
> > > small tweaks to comments and code.
> >
> > I saw benchmarks in this thread to show that there's no regression when
> > the data is in cache, but I didn't see any benchmarks demonstrating the
> > benefit of this. So I ran this quick test:
>
> Good point! It would be good to show why we would actually want this
> patch. Attached v8 is rebased over current master (which now has the
> streaming read API).

Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
due to intentional changes.  No efficiency is coming from batching
anything.  Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out?  (Makes me
wonder what happens if you insert a memory prefetch for the page
header into that production line, and if there are more opportunities
to spread dependencies out eg hashing the buffer tag ahead of time.)

> On the topic of BAS_BULKREAD buffer access strategy, I think the least
> we could do is add an assert like this to read_stream_begin_relation()
> after calculating max_pinned_buffers.
>
>     Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
>
> Perhaps we should do more? I think with a ring size of 16 MB, large
> SELECTs are safe for now. But I know future developers will make
> changes and it would be good not to assume they will understand that
> pinning more buffers than the size of the ring effectively invalidates
> the ring.

Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:

    if (strategy)
    {
        int strategy_buffers = GetAccessStrategyBufferCount(strategy);
        max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
    }

I just don't know where to get that '2'.  The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.

Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true.  The size of main
memory and L2 cache have increased dramatically since those strategies
were invented.  I think we should at least double them, and more
likely quadruple them.  I realise you already made them configurable
per command in commit 1cbbee033, but I mean the hard coded default 256
in freelist.c.  It's not only to get more space for our prefetching
plans, it's also to give the system more chance of flushing WAL
asynchronously/in some other backend before you crash into dirty data;
as you discovered, prefetching accidentally makes that effect worse
in.a BAS_VACUUM strategy, by taking away space that is effectively
deferring WAL flushes, so I'd at least like to double the size for if
we use "/ 2" above.



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Wed, Apr 3, 2024 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > On the topic of BAS_BULKREAD buffer access strategy, I think the least
> > we could do is add an assert like this to read_stream_begin_relation()
> > after calculating max_pinned_buffers.
> >
> >     Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
> >
> > Perhaps we should do more? I think with a ring size of 16 MB, large
> > SELECTs are safe for now. But I know future developers will make
> > changes and it would be good not to assume they will understand that
> > pinning more buffers than the size of the ring effectively invalidates
> > the ring.
>
> Yeah I think we should clamp max_pinned_buffers if we see a strategy.
> What do you think about:
>
>     if (strategy)
>     {
>         int strategy_buffers = GetAccessStrategyBufferCount(strategy);
>         max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
>     }
>
> I just don't know where to get that '2'.  The idea would be to
> hopefully never actually be constrained by it in practice, except in
> tiny/toy setups, so we can't go too wrong with our number '2' there.

Yea, I don't actually understand why not just use strategy_buffers - 1
or something. 1/2 seems like a big limiting factor for those
strategies with small rings.

I don't really think it will come up for SELECT queries since they
rely on readahead and not prefetching.
It does seem like it could easily come up for analyze.

But I am on board with the idea of clamping for sequential scan/TID
range scan. For vacuum, we might have to think harder. If the user
specifies buffer_usage_limit and io_combine_limit and they are never
reaching IOs of io_combine_limit because of their buffer_usage_limit
value, then we should probably inform them.

- Melanie



Re: Streaming read-ready sequential scan code

From
Andres Freund
Date:
Hi,

On 2024-04-04 09:27:35 +1300, Thomas Munro wrote:
> Anecdotally by all reports I've seen so far, all-in-cache seems to be
> consistently a touch faster than master if anything, for streaming seq
> scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
> due to intentional changes.  No efficiency is coming from batching
> anything.  Perhaps it has to do with CPU pipelining effects: though
> it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
> itself is cut up into stages in a kind of pipeline:
> read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
> returns page n each time you call it, so maybe we get more CPU
> parallelism due to spreading the data dependencies out?

Another theory is that it's due to the plain ReadBuffer() path needing to do
RelationGetSmgr(reln) on every call, whereas the streaming read path doesn't
need to.



> > On the topic of BAS_BULKREAD buffer access strategy, I think the least
> > we could do is add an assert like this to read_stream_begin_relation()
> > after calculating max_pinned_buffers.
> >
> >     Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
> >
> > Perhaps we should do more? I think with a ring size of 16 MB, large
> > SELECTs are safe for now. But I know future developers will make
> > changes and it would be good not to assume they will understand that
> > pinning more buffers than the size of the ring effectively invalidates
> > the ring.
>
> Yeah I think we should clamp max_pinned_buffers if we see a strategy.
> What do you think about:
>
>     if (strategy)
>     {
>         int strategy_buffers = GetAccessStrategyBufferCount(strategy);
>         max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
>     }

> I just don't know where to get that '2'.  The idea would be to
> hopefully never actually be constrained by it in practice, except in
> tiny/toy setups, so we can't go too wrong with our number '2' there.

The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
should we apply that only if the ring the strategy doesn't use the
StrategyRejectBuffer() logic?

I think it's fine to add a handwavy justification for the 2, saying that we
want to balance readahead distance and reducing WAL write frequency, and that
at some point more sophisticated logic will be needed.


> Then we should increase the default ring sizes for BAS_BULKREAD and
> BAS_VACUUM to make the previous statement true.

I'm not sure it's right tying them together. The concerns for both are fairly
different.


> The size of main memory and L2 cache have increased dramatically since those
> strategies were invented.  I think we should at least double them, and more
> likely quadruple them.  I realise you already made them configurable per
> command in commit 1cbbee033, but I mean the hard coded default 256 in
> freelist.c.  It's not only to get more space for our prefetching plans, it's
> also to give the system more chance of flushing WAL asynchronously/in some
> other backend before you crash into dirty data; as you discovered,
> prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy,
> by taking away space that is effectively deferring WAL flushes, so I'd at
> least like to double the size for if we use "/ 2" above.

I think for VACUUM we should probably go a bit further. There's no comparable
L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Greetings,

Andres Freund



Re: Streaming read-ready sequential scan code

From
David Rowley
Date:
On Thu, 4 Apr 2024 at 06:03, Melanie Plageman <melanieplageman@gmail.com> wrote:
> Attached v8 is rebased over current master (which now has the
> streaming read API).

I've looked at the v8-0001 patch.

I wasn't too keen on heapbuildvis() as a function name for an external
function.  Since it also does pruning work, it seemed weird to make it
sound like it only did visibility work.  Per our offline discussion
about names, I've changed it to what you suggested which is
heap_page_prep().

Aside from that, there was an outdated comment: "In pageatatime mode,
heapgetpage() already did visibility checks," which was no longer true
as that's done in heapbuildvis() (now heap_page_prep()).

I also did a round of comment adjustments as there were a few things I
didn't like, e.g:

+ * heapfetchbuf - subroutine for heapgettup()

This is also used in heapgettup_pagemode(), so I thought it was a bad
idea for a function to list places it thinks it's being called.   I
also thought it redundant to write "This routine" in the function head
comment. I think "this routine" is implied by the context. I ended up
with:

/*
 * heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
 *
 * Read the specified block of the scan relation into a buffer and pin that
 * buffer before saving it in the scan descriptor.
 */

I'm happy with your changes to heapam_scan_sample_next_block(). I did
adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
the same as the seqscan version, just with "seqscan" swapped for
"sample scan".

The only other thing I adjusted there was to use "blockno" in some
places where you were using hscan->rs_cblock.  These all come after
the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
is more likely to avoid reading the value from the struct again if the
compiler isn't happy that the two values are still equivalent for some
reason.  Even if the compiler is happy today, it would only take a
code change to pass hscan to some external function for the compiler
to perhaps be uncertain if that function has made an adjustment to
rs_cblock and go with the safe option of pulling the value out the
struct again which is a little more expensive as it requires some
maths to figure out the offset.

I've attached v9-0001 and a delta of just my changes from v8.

David

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Wed, Apr 3, 2024 at 9:08 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 4 Apr 2024 at 06:03, Melanie Plageman <melanieplageman@gmail.com> wrote:
> > Attached v8 is rebased over current master (which now has the
> > streaming read API).
>
> I've looked at the v8-0001 patch.

Thanks for taking a look!

> I wasn't too keen on heapbuildvis() as a function name for an external
> function.  Since it also does pruning work, it seemed weird to make it
> sound like it only did visibility work.  Per our offline discussion
> about names, I've changed it to what you suggested which is
> heap_page_prep().

Looking at it in the code, I am wondering if we should call
heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
that it is prepping a page to be scanned. You choose whatever you
think is best.

> Aside from that, there was an outdated comment: "In pageatatime mode,
> heapgetpage() already did visibility checks," which was no longer true
> as that's done in heapbuildvis() (now heap_page_prep()).
>
> I also did a round of comment adjustments as there were a few things I
> didn't like, e.g:
>
> + * heapfetchbuf - subroutine for heapgettup()
>
> This is also used in heapgettup_pagemode(), so I thought it was a bad
> idea for a function to list places it thinks it's being called.   I
> also thought it redundant to write "This routine" in the function head
> comment. I think "this routine" is implied by the context. I ended up
> with:
>
> /*
>  * heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
>  *
>  * Read the specified block of the scan relation into a buffer and pin that
>  * buffer before saving it in the scan descriptor.
>  */
>
> I'm happy with your changes to heapam_scan_sample_next_block(). I did
> adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
> the same as the seqscan version, just with "seqscan" swapped for
> "sample scan".

That all is fine with me.

> The only other thing I adjusted there was to use "blockno" in some
> places where you were using hscan->rs_cblock.  These all come after
> the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
> is more likely to avoid reading the value from the struct again if the
> compiler isn't happy that the two values are still equivalent for some
> reason.  Even if the compiler is happy today, it would only take a
> code change to pass hscan to some external function for the compiler
> to perhaps be uncertain if that function has made an adjustment to
> rs_cblock and go with the safe option of pulling the value out the
> struct again which is a little more expensive as it requires some
> maths to figure out the offset.
>
> I've attached v9-0001 and a delta of just my changes from v8.

All sounds good and LGTM

- Melanie



Re: Streaming read-ready sequential scan code

From
David Rowley
Date:
On Thu, 4 Apr 2024 at 14:38, Melanie Plageman <melanieplageman@gmail.com> wrote:
> Looking at it in the code, I am wondering if we should call
> heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
> that it is prepping a page to be scanned. You choose whatever you
> think is best.

I ended up calling it heap_prepare_pagescan() as I started to think
prep/prepare should come first. I don't think it's perfect as the
intended meaning is heap_prepare_page_for_scanning_in_pagemode(), but
that's obviously too long.

I've pushed the v9-0001 with that rename done.

David



Re: Streaming read-ready sequential scan code

From
David Rowley
Date:
On Thu, 4 Apr 2024 at 16:45, David Rowley <dgrowleyml@gmail.com> wrote:
> I've pushed the v9-0001 with that rename done.

I've now just pushed the 0002 patch with some revisions:

1. The function declarations you added for heapgettup_advance_block()
and heapgettup_initial_block() didn't match the properties of their
definitions.  You'd declared both of these static inline but neither
of these were.
2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as
that's effectively what it does with v8-0002, however, that's just too
many words all shoved together, so I renamed it to
heap_fetch_next_buffer().
3. I changed heapgettup_initial_block() to pg_noinline as it both
makes more sense to have this out of line and it also fixed a small
performance regression.

Looks like I also failed to grep for all the remaining instances of
"heapgetpage" in 44086b097.  Those are now fixed by 3a4a3537a.

I also rebased the 0003 patch which I've attached as a raw patch.

FWIW, using Heikki's test in [1] with a pg_prewarm each time after
restarting the instance. No parallel aggregate.

drowley@amd3990x:~$ cat bench.sql
 select count(*) from giga;

drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency

44086b097~1
latency average = 34.323 ms
latency average = 34.332 ms

44086b097
latency average = 34.811 ms
latency average = 34.862 ms

3a4a3537a
latency average = 34.497 ms
latency average = 34.538 ms

3a4a3537a + read_stream_for_seqscans.patch
latency average = 40.923 ms
latency average = 41.415 ms

i.e. no meaningful change from the refactor, but a regression from a
cached workload that changes the page often without doing much work in
between with the read stread patch.

I'm happy to run further benchmarks, but for the remainder of the
committer responsibility for the next patches, I'm going to leave that
to Thomas.

David

[1] https://www.postgresql.org/message-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1@iki.fi

Attachment

Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Thu, Apr 4, 2024 at 11:13 AM Andres Freund <andres@anarazel.de> wrote:
> The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
> should we apply that only if the ring the strategy doesn't use the
> StrategyRejectBuffer() logic?

Hmm, I don't really know, but that sounds plausible.  What do you
think about the attached?

> I think for VACUUM we should probably go a bit further. There's no comparable
> L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
> expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Alright what about this?

Attachment

Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Alright what about this?

Forgot to git add a change, new version.

Attachment

Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Thu, Apr 4, 2024 at 8:02 PM David Rowley <dgrowleyml@gmail.com> wrote:
> 3a4a3537a
> latency average = 34.497 ms
> latency average = 34.538 ms
>
> 3a4a3537a + read_stream_for_seqscans.patch
> latency average = 40.923 ms
> latency average = 41.415 ms
>
> i.e. no meaningful change from the refactor, but a regression from a
> cached workload that changes the page often without doing much work in
> between with the read stread patch.

I ran Heikki's test except I ran the "insert" 4 times to get a table
of 4376MB according to \d+.  On my random cloud ARM server (SB=8GB,
huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms
when the data is in LInux cache and PG is not prewarmed, roughly as he
reported.  Good.

If I pg_prewarm first, I see that slowdown 260ms -> 294ms.  Trying
things out to see what works, I got that down to 243ms (ie beat
master) by inserting a memory prefetch:

--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void
**per_buffer_data)
        /* Prepare for the next call. */
        read_stream_look_ahead(stream, false);

+       __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));

Maybe that's a solution to a different problem that just happens to
more than make up the difference in this case, and it may be
questionable whether that cache line will survive long enough to help
you, but this one-tuple-per-page test likes it...  Hmm, to get a more
realistic table than the one-tuple-per-page on, I tried doubling a
tenk1 table until it reached 2759MB and then I got a post-prewarm
regression 702ms -> 721ms, and again I can beat master by memory
prefetching: 689ms.

Annoyingly, with the same table I see no difference between the actual
pg_prewarm('giga') time: around 155ms for both.  pg_prewarm is able to
use the 'fast path' I made pretty much just to be able to minimise
regression in that (probably stupid) all-cached tes that doesn't even
look at the page contents.  Unfortunately seq scan can't use it
because it has per-buffer data, which is one of the things it can't do
(because of a space management issue).  Maybe I should try to find a
way to fix that.

> I'm happy to run further benchmarks, but for the remainder of the
> committer responsibility for the next patches, I'm going to leave that
> to Thomas.

Thanks!



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Thu, Apr 4, 2024 at 3:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 4 Apr 2024 at 16:45, David Rowley <dgrowleyml@gmail.com> wrote:
> > I've pushed the v9-0001 with that rename done.
>
> I've now just pushed the 0002 patch with some revisions:

Thanks!

> 1. The function declarations you added for heapgettup_advance_block()
> and heapgettup_initial_block() didn't match the properties of their
> definitions.  You'd declared both of these static inline but neither
> of these were.

Ah, they needed to be defined static but I intentionally left off the
inline in the definition and only put it in the forward declaration
because I thought that was correct.  Anyway, I'm fine with how you did
it in the end.

> 2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as
> that's effectively what it does with v8-0002, however, that's just too
> many words all shoved together, so I renamed it to
> heap_fetch_next_buffer().

Sounds good.

> 3. I changed heapgettup_initial_block() to pg_noinline as it both
> makes more sense to have this out of line and it also fixed a small
> performance regression.

Ah, I guess it is in the unlikely path. I often forget that noinline
exists. It's interesting that that made a noticeable difference since
it is a pretty short function. Thanks for taking such a close look!

> Looks like I also failed to grep for all the remaining instances of
> "heapgetpage" in 44086b097.  Those are now fixed by 3a4a3537a.
>
> I also rebased the 0003 patch which I've attached as a raw patch.

Thanks!

> FWIW, using Heikki's test in [1] with a pg_prewarm each time after
> restarting the instance. No parallel aggregate.
>
> drowley@amd3990x:~$ cat bench.sql
>  select count(*) from giga;
>
> drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency
>
> 44086b097~1
> latency average = 34.323 ms
> latency average = 34.332 ms
>
> 44086b097
> latency average = 34.811 ms
> latency average = 34.862 ms
>
> 3a4a3537a
> latency average = 34.497 ms
> latency average = 34.538 ms
>
> 3a4a3537a + read_stream_for_seqscans.patch
> latency average = 40.923 ms
> latency average = 41.415 ms
>
> i.e. no meaningful change from the refactor, but a regression from a
> cached workload that changes the page often without doing much work in
> between with the read stread patch.

Cool. Thanks for testing this out. Sounds like Thomas did some
analysis of how to resolve this for the streaming read user, and I
will do some too.

- Melanie



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Thu, Apr 4, 2024 at 7:45 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Apr 4, 2024 at 8:02 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > 3a4a3537a
> > latency average = 34.497 ms
> > latency average = 34.538 ms
> >
> > 3a4a3537a + read_stream_for_seqscans.patch
> > latency average = 40.923 ms
> > latency average = 41.415 ms
> >
> > i.e. no meaningful change from the refactor, but a regression from a
> > cached workload that changes the page often without doing much work in
> > between with the read stread patch.
>
> I ran Heikki's test except I ran the "insert" 4 times to get a table
> of 4376MB according to \d+.  On my random cloud ARM server (SB=8GB,
> huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms
> when the data is in LInux cache and PG is not prewarmed, roughly as he
> reported.  Good.
>
> If I pg_prewarm first, I see that slowdown 260ms -> 294ms.  Trying
> things out to see what works, I got that down to 243ms (ie beat
> master) by inserting a memory prefetch:
>
> --- a/src/backend/storage/aio/read_stream.c
> +++ b/src/backend/storage/aio/read_stream.c
> @@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void
> **per_buffer_data)
>         /* Prepare for the next call. */
>         read_stream_look_ahead(stream, false);
>
> +       __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));
>
> Maybe that's a solution to a different problem that just happens to
> more than make up the difference in this case, and it may be
> questionable whether that cache line will survive long enough to help
> you, but this one-tuple-per-page test likes it...  Hmm, to get a more
> realistic table than the one-tuple-per-page on, I tried doubling a
> tenk1 table until it reached 2759MB and then I got a post-prewarm
> regression 702ms -> 721ms, and again I can beat master by memory
> prefetching: 689ms.
>
> Annoyingly, with the same table I see no difference between the actual
> pg_prewarm('giga') time: around 155ms for both.  pg_prewarm is able to
> use the 'fast path' I made pretty much just to be able to minimise
> regression in that (probably stupid) all-cached tes that doesn't even
> look at the page contents.  Unfortunately seq scan can't use it
> because it has per-buffer data, which is one of the things it can't do
> (because of a space management issue).  Maybe I should try to find a
> way to fix that.

So, sequential scan does not have per-buffer data. I did some logging
and the reason most fully-in-SB sequential scans don't use the fast
path is because read_stream->pending_read_nblocks is always 0.

When when read_stream->distance stays 1 (expected for all-in-SB as it
is initialized to 1 and we don't want distance to ramp up),
read_stream_look_ahead() never increments
read_stream->pending_read_nblocks because it sets it to 1 the first
time it is called and then the conditions of the while loop are not
met again

    while (stream->ios_in_progress < stream->max_ios &&
           stream->pinned_buffers + stream->pending_read_nblocks <
stream->distance)

distance is 1, pending_read_nblocks is 1, thus we only loop once and
don't increment pending_read_nblocks.

prewarm is only able to use the fast path because it passes
READ_STREAM_FULL and thus initializes read_stream->distance to a
higher initial value.

I added some logging to see if any of the sequential scans in the
regression suite used the fast path. The one example I see of the fast
path being used is a temp table IO stats test in
src/test/regress/sql/stats.sql. I didn't check exactly what conditions
led it to do this. But we probably want seq scans which are all in SB
to use the fast path.

- Melanie



Re: Streaming read-ready sequential scan code

From
Andres Freund
Date:
Hi,

On 2024-04-04 22:37:39 +1300, Thomas Munro wrote:
> On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Alright what about this?

I think it's probably worth adding a bit more of the commit message to the
function comment. Yes, there's a bit in one of the return branches, but that's
not what you're going to look at when just checking what the function does.


> From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Thu, 4 Apr 2024 21:11:06 +1300
> Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB.
> 
> The BAS_VACUUM ring size has been 256kB since commit d526575f.  Commit
> 1cbbee03 made it configurable but retained the traditional default.
> The correct default size has been debated for years, but 256kB is
> certainly very small.  VACUUM soon needs to write back data it dirtied
> only 32 blocks ago, which usually requires flushing the WAL.  New
> experiments in prefetching pages for VACUUM exacerbated the problem by
> crashing into dirty data even sooner.  Let's make the default 2MB.
> That's 1.5% of the default toy buffer pool size, and 0.2% of 1GB, which
> would be a considered a small shared_buffers setting for a real system
> these days.  Users are still free to set the GUC to a different value.

+1.  Independent of any other changes, this improves the default vacuum
performance substantially. We might want to dynamically size the default at
some point - but we probably should overhaul the infrastructure first...


Greetings,

Andres Freund



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Fri, Apr 5, 2024 at 4:20 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> So, sequential scan does not have per-buffer data. I did some logging
> and the reason most fully-in-SB sequential scans don't use the fast
> path is because read_stream->pending_read_nblocks is always 0.

Hnghghghgh... right, sorry I guessed the wrong reason, it turns out
that I made a fast path just a little too specialised for pg_prewarm.
Working on it...



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
Yeah, I plead benchmarking myopia, sorry.  The fastpath as committed
is only reached when distance goes 2->1, as pg_prewarm does.  Oops.
With the attached minor rearrangement, it works fine.  I also poked
some more at that memory prefetcher.  Here are the numbers I got on a
desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
GCC -O3).

create table t (i int, filler text) with (fillfactor=10);
insert into t
select g, repeat('x', 900) from generate_series(1, 560000) g;
vacuum freeze t;
set max_parallel_workers_per_gather = 0;

select count(*) from t;

cold = must be read from actual disk (Linux drop_caches)
warm = read from linux page cache
hot = already in pg cache via pg_prewarm

                                    cold   warm    hot
master                            2479ms  886ms  200ms
seqscan                           2498ms  716ms  211ms <-- regression
seqscan + fastpath                2493ms  711ms  200ms <-- fixed, I think?
seqscan + memprefetch             2499ms  716ms  182ms
seqscan + fastpath + memprefetch  2505ms  710ms  170ms <-- \O/

Cold has no difference.  That's just my disk demonstrating Linux RA at
128kB (default); random I/O is obviously a more interesting story.
It's consistently a smidgen faster with Linux RA set to 2MB (as in
blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
probably also increases on fancier faster storage than what I have on
hand:

                                    cold
master                            1775ms
seqscan + fastpath + memprefetch  1700ms

Warm is faster as expected (fewer system calls schlepping data
kernel->userspace).

The interesting column is hot.  The 200ms->211ms regression is due to
the extra bookkeeping in the slow path.  The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
you reproduce that?

The memory prefetching trick, on top of that, seems to be a good
optimisation so far.  Note that that's not an entirely independent
trick, it's something we can only do now that we can see into the
future; it's the next level up of prefetching, worth doing around 60ns
before you need the data I guess.  Who knows how thrashed the cache
might be before the caller gets around to accessing that page, but
there doesn't seem to be much of a cost or downside to this bet.  We
know there are many more opportunities like that[1] but I don't want
to second-guess the AM here, I'm just betting that the caller is going
to look at the header.

Unfortunately there seems to be a subtle bug hiding somewhere in here,
visible on macOS on CI.  Looking into that, going to find my Mac...

[1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Thu, Apr 4, 2024 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2024-04-04 22:37:39 +1300, Thomas Munro wrote:
> > On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > Alright what about this?
>
> I think it's probably worth adding a bit more of the commit message to the
> function comment. Yes, there's a bit in one of the return branches, but that's
> not what you're going to look at when just checking what the function does.

Agreed about the comment. I kept thinking that BAS_BULKREAD should
maybe return nbuffers - 1, but I couldn't convince myself why.
Otherwise v2-0001-Allow-BufferAccessStrategy-to-limit-pin-count LGTM.

- Melanie



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Yeah, I plead benchmarking myopia, sorry.  The fastpath as committed
> is only reached when distance goes 2->1, as pg_prewarm does.  Oops.
> With the attached minor rearrangement, it works fine.  I also poked
> some more at that memory prefetcher.  Here are the numbers I got on a
> desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
> cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
> GCC -O3).
>
> create table t (i int, filler text) with (fillfactor=10);
> insert into t
> select g, repeat('x', 900) from generate_series(1, 560000) g;
> vacuum freeze t;
> set max_parallel_workers_per_gather = 0;
>
> select count(*) from t;
>
> cold = must be read from actual disk (Linux drop_caches)
> warm = read from linux page cache
> hot = already in pg cache via pg_prewarm
>
>                                     cold   warm    hot
> master                            2479ms  886ms  200ms
> seqscan                           2498ms  716ms  211ms <-- regression
> seqscan + fastpath                2493ms  711ms  200ms <-- fixed, I think?
> seqscan + memprefetch             2499ms  716ms  182ms
> seqscan + fastpath + memprefetch  2505ms  710ms  170ms <-- \O/
>
> Cold has no difference.  That's just my disk demonstrating Linux RA at
> 128kB (default); random I/O is obviously a more interesting story.
> It's consistently a smidgen faster with Linux RA set to 2MB (as in
> blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
> probably also increases on fancier faster storage than what I have on
> hand:
>
>                                     cold
> master                            1775ms
> seqscan + fastpath + memprefetch  1700ms
>
> Warm is faster as expected (fewer system calls schlepping data
> kernel->userspace).
>
> The interesting column is hot.  The 200ms->211ms regression is due to
> the extra bookkeeping in the slow path.  The rejiggered fastpath code
> fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
> you reproduce that?

I am able to reproduce the fast path solving the issue using Heikki's
example here [1] but in shared buffers (hot).

master:                           25 ms
stream read:                   29 ms
stream read + fast path: 25 ms

I haven't looked into or reviewed the memory prefetching part.

While reviewing 0002, I realized that I don't quite see how
read_stream_get_block() will be used in the fastpath -- which it
claims in its comments.
read_stream_next_buffer() is the only caller of
read_stream_look_ahead()->read_stream_get_block(), and if fast_path is
true, read_stream_next_buffer() always returns before calling
read_stream_look_ahead(). Maybe I am missing something. I see
fast_path uses read_stream_fill_blocknums() to invoke the callback.

Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist?

Otherwise 0002 looks good to me.

I haven't reviewed 0003 or 0004. I attached a new version (v11)
because I noticed an outdated comment in my seq scan streaming read
user patch (0001). The other patches in the set are untouched from
your versions besides adding author/reviewer info in commit message
for 0002.

- Melanie

[1] https://www.postgresql.org/message-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1%40iki.fi

Attachment

Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > The interesting column is hot.  The 200ms->211ms regression is due to
> > the extra bookkeeping in the slow path.  The rejiggered fastpath code
> > fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
> > you reproduce that?
>
> I am able to reproduce the fast path solving the issue using Heikki's
> example here [1] but in shared buffers (hot).
>
> master:                           25 ms
> stream read:                   29 ms
> stream read + fast path: 25 ms

Great, thanks.

> I haven't looked into or reviewed the memory prefetching part.
>
> While reviewing 0002, I realized that I don't quite see how
> read_stream_get_block() will be used in the fastpath -- which it
> claims in its comments.

Comments improved.

> Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist?

Just for testing purposes.  Behaviour should be identical to external
observers either way, it's just a hand-rolled specialisation for
certain parameters, and it's useful to be able to verify that and
measure the speedup.  I think it's OK to leave a bit of
developer/testing scaffolding in the tree if it's likely to be useful
again and especially if like this case it doesn't create any dead
code.  (Perhaps in later work we might find the right way to get the
compiler to do the specialisation?  It's so simple though.)

The occasional CI problem I mentioned turned out to be
read_stream_reset() remembering a little too much state across
rescans.  Fixed.

Thanks both for the feedback on the ring buffer tweaks.  Comments
updated.  Here is the full stack of patches I would like to commit
very soon, though I may leave the memory prefetching part out for a
bit longer to see if I can find any downside.

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Fri, Apr 5, 2024 at 7:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I haven't looked into or reviewed the memory prefetching part.
> >
> > While reviewing 0002, I realized that I don't quite see how
> > read_stream_get_block() will be used in the fastpath -- which it
> > claims in its comments.
>
> Comments improved.

Ah, makes sense now.

> The occasional CI problem I mentioned turned out to be
> read_stream_reset() remembering a little too much state across
> rescans.  Fixed.

Nice investigative work figuring this out.

> Thanks both for the feedback on the ring buffer tweaks.  Comments
> updated.  Here is the full stack of patches I would like to commit
> very soon, though I may leave the memory prefetching part out for a
> bit longer to see if I can find any downside.

0001 LGTM. I did not review 0002 or 0003.

0004 looks good except for one comment typo:

           /*
            * Tell call not to pin more than half the buffers in the ring.
            * This is a trade-off between look ahead distance and deferring
            * writeback and associated WAL traffic.
            */

call -> caller

0006, I noticed the commit message is missing an important comma:

Instead of calling ReadBuffer() for each block heap sequential scans and
TID range scans now use the streaming read API introduced in b5a9b18cd0.

should be

Instead of calling ReadBuffer() for each block, heap sequential scans and
TID range scans now use the streaming read API introduced in b5a9b18cd0.

- Melanie



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
I found a bug in read_stream.c that could be hit with Melanie's
streaming seq scan patch with parallelism enabled and certain buffer
pool conditions.  Short version: there is an edge case where an "if"
needed to be a "while", or we could lose a few blocks.  Here's the fix
for that, longer explanation in commit message.

Attachment

Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
On Sat, Apr 6, 2024 at 9:25 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I found a bug in read_stream.c that could be hit with Melanie's
> streaming seq scan patch with parallelism enabled and certain buffer
> pool conditions.  Short version: there is an edge case where an "if"
> needed to be a "while", or we could lose a few blocks.  Here's the fix
> for that, longer explanation in commit message.

Attached v13 0001 is your fix and 0002 is a new version of the
sequential scan streaming read user. Off-list Andres mentioned that I
really ought to separate the parallel and serial sequential scan users
into two different callbacks. I've done that in the attached. It
actually makes the code used by the callbacks nicer and more readable
anyway (putting aside performance). I was able to measure a small
performance difference as well.

I've also added a few comments and improved existing comments.

- Melanie

Attachment

Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Sun, Apr 7, 2024 at 1:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Attached v13 0001 is your fix and 0002 is a new version of the
> sequential scan streaming read user. Off-list Andres mentioned that I
> really ought to separate the parallel and serial sequential scan users
> into two different callbacks. I've done that in the attached. It
> actually makes the code used by the callbacks nicer and more readable
> anyway (putting aside performance). I was able to measure a small
> performance difference as well.

Thanks.  I changed a couple of very trivial things before pushing.

+        BlockNumber (*cb) (ReadStream *pgsr, void *private_data,
+                           void *per_buffer_data);

This type has a friendly name: ReadStreamBlockNumberCB.

+        scan->rs_read_stream =
read_stream_begin_relation(READ_STREAM_SEQUENTIAL,

I've been on the fence about that flag for sequential scan...  Some
days I want to consider changing to READ_STREAM_DEFAULT and relying on
our "anti-heuristics" to suppress advice, which would work out the
same in most cases but might occasionally win big.  It might also
hurt, I dunno, so I suspect we'd have to make it better first, which
my patch in [1] was a first swing at, but I haven't researched that
enough.  So, kept this way!

- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.

Changed to:

 * Read the next block of the scan relation from the read stream and save it
 * in the scan descriptor.  It is already pinned.

+static BlockNumber
+heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data,
+                                    void *per_buffer_data)

Changed argument names to match the function pointer type definition,
"stream" and "callback_private_data".

BTW looking at the branching in read-stream user patches that have an
initialisation step like yours, I wonder if it might every make sense
to be able to change the callback on the fly from inside the callback,
so that you finish up with a branchless one doing most of the work.  I
have no idea if it's worth it...

[1] https://www.postgresql.org/message-id/CA%2BhUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV%2Bvk8fwMbreA%40mail.gmail.com



Re: Streaming read-ready sequential scan code

From
Andres Freund
Date:
Hi,

On 2024-04-08 09:36:59 +1200, Thomas Munro wrote:
> I've been on the fence about that flag for sequential scan...  Some
> days I want to consider changing to READ_STREAM_DEFAULT and relying on
> our "anti-heuristics" to suppress advice, which would work out the
> same in most cases but might occasionally win big.

Agreed, it's pretty easy to end up with a fairly "fragmented" set of a
relation's buffers in s_b.  OTOH, there might not be any need for the
heuristic if we actually trigger reads asynchronously.


> BTW looking at the branching in read-stream user patches that have an
> initialisation step like yours, I wonder if it might every make sense
> to be able to change the callback on the fly from inside the callback,
> so that you finish up with a branchless one doing most of the work.  I
> have no idea if it's worth it...

I was wondering about that too, I dislike those branches. But instead of
changing the callback, it seems like a better design would be to have another
dedicated callback for that?  There already is a dedicated branch for the
"just starting up" path in read_stream_next_buffer(), so it'd be pretty much
free to call another callback there.

Greetings,

Andres Freund



Re: Streaming read-ready sequential scan code

From
Alexander Lakhin
Date:
Hello,

I decided to compare v17 vs v16 performance (as I did the last year [1])
and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
benchmark, query15 (and several others, but I focused on this one):
Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15
Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15
Please look at the full html report attached in case you're interested.

(I used my pg-mark tool to measure/analyze performance, but I believe the
same results can be seen without it.)

`git bisect` for this performance degradation pointed at b7b0f3f27...

[1] https://www.postgresql.org/message-id/b32bed1b-0746-9b20-1472-4bdc9ca66d52%40gmail.com

Best regards,
Alexander
Attachment

Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I decided to compare v17 vs v16 performance (as I did the last year [1])
> and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
> benchmark, query15 (and several others, but I focused on this one):
> Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15
> Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15
> Please look at the full html report attached in case you're interested.
>
> (I used my pg-mark tool to measure/analyze performance, but I believe the
> same results can be seen without it.)

Will investigate, but if it's easy for you to rerun, does it help if
you increase Linux readahead, eg blockdev --setra setting?



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Sat, May 18, 2024 at 8:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > I decided to compare v17 vs v16 performance (as I did the last year [1])
> > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
> > benchmark, query15 (and several others, but I focused on this one):
> > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15
> > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15
> > Please look at the full html report attached in case you're interested.
> >
> > (I used my pg-mark tool to measure/analyze performance, but I believe the
> > same results can be seen without it.)
>
> Will investigate, but if it's easy for you to rerun, does it help if
> you increase Linux readahead, eg blockdev --setra setting?

Andres happened to have TPC-DS handy, and reproduced that regression
in q15.  We tried some stuff and figured out that it requires
parallel_leader_participation=on, ie that this looks like some kind of
parallel fairness and/or timing problem.  It seems to be a question of
which worker finishes up processing matching rows, and the leader gets
a ~10ms head start but may be a little more greedy with the new
streaming code.  He tried reordering the table contents and then saw
17 beat 16.  So for q15, initial indications are that this isn't a
fundamental regression, it's just a test that is sensitive to some
arbitrary conditions.

I'll try to figure out some more details about that, ie is it being
too greedy on small-ish tables, and generally I do wonder about the
interactions between the heuristics and batching working at different
levels (OS, seq scan, read stream, hence my earlier ra question which
is likely a red herring) and how there might be unintended
consequences/interference patterns, but this particular case seems
more data dependent.



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Sat, May 18, 2024 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Andres happened to have TPC-DS handy, and reproduced that regression
> in q15.  We tried some stuff and figured out that it requires
> parallel_leader_participation=on, ie that this looks like some kind of
> parallel fairness and/or timing problem.  It seems to be a question of
> which worker finishes up processing matching rows, and the leader gets
> a ~10ms head start but may be a little more greedy with the new
> streaming code.  He tried reordering the table contents and then saw
> 17 beat 16.  So for q15, initial indications are that this isn't a
> fundamental regression, it's just a test that is sensitive to some
> arbitrary conditions.
>
> I'll try to figure out some more details about that, ie is it being
> too greedy on small-ish tables,

After more debugging, we learned a lot more things...

1.  That query produces spectacularly bad estimates, so we finish up
having to increase the number of buckets in a parallel hash join many
times.  That is quite interesting, but unrelated to new code.
2.  Parallel hash join is quite slow at negotiating an increase in the
number of hash bucket, if all of the input tuples are being filtered
out by quals, because of the choice of where workers check for
PHJ_GROWTH_NEED_MORE_BUCKETS.  That could be improved quite easily I
think.  I have put that on my todo list 'cause that's also my code,
but it's not a new issue it's just one that is now highlighted...
3.  This bit of read_stream.c is exacerbating unfairness in the
underlying scan, so that 1 and 2 come together and produce a nasty
slowdown, which goes away if you change it like so:

-       BlockNumber blocknums[16];
+       BlockNumber blocknums[1];

I will follow up after some more study.



Re: Streaming read-ready sequential scan code

From
Alexander Lakhin
Date:
Hello Thomas,

18.05.2024 07:47, Thomas Munro wrote:
> After more debugging, we learned a lot more things...
>
> 1.  That query produces spectacularly bad estimates, so we finish up
> having to increase the number of buckets in a parallel hash join many
> times.  That is quite interesting, but unrelated to new code.
> 2.  Parallel hash join is quite slow at negotiating an increase in the
> number of hash bucket, if all of the input tuples are being filtered
> out by quals, because of the choice of where workers check for
> PHJ_GROWTH_NEED_MORE_BUCKETS.  That could be improved quite easily I
> think.  I have put that on my todo list 'cause that's also my code,
> but it's not a new issue it's just one that is now highlighted...
> 3.  This bit of read_stream.c is exacerbating unfairness in the
> underlying scan, so that 1 and 2 come together and produce a nasty
> slowdown, which goes away if you change it like so:
>
> -       BlockNumber blocknums[16];
> +       BlockNumber blocknums[1];
>
> I will follow up after some more study.

Thank you for the information!

Unfortunately, I can't see significant differences in my environment with
parallel_leader_participation=off.

With blocknums[1], timing is changed, but the effect is not persistent.
10 query15 executions in a row, b7b0f3f27:
277.932 ms
281.805 ms
278.335 ms
281.565 ms
284.167 ms
283.171 ms
281.165 ms
281.615 ms
285.394 ms
277.301 ms

b7b0f3f27~1:
159.789 ms
165.407 ms
160.893 ms
159.343 ms
160.936 ms
161.577 ms
161.637 ms
163.421 ms
163.143 ms
167.109 ms

b7b0f3f27 + blocknums[1]:
164.133 ms
280.920 ms
160.748 ms
163.182 ms
161.709 ms
161.998 ms
161.239 ms
276.256 ms
161.601 ms
160.384 ms

I placed PGDATA on tmpfs to rule out any blockdev specifics (increasing
blockdev ra from 256 to 4096 didn't help me with PGDATA on NVME either.)

Best regards,
Alexander



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Sun, May 19, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> With blocknums[1], timing is changed, but the effect is not persistent.
> 10 query15 executions in a row, b7b0f3f27:
> 277.932 ms
> 281.805 ms
> 278.335 ms
> 281.565 ms
> 284.167 ms
> 283.171 ms
> 281.165 ms
> 281.615 ms
> 285.394 ms
> 277.301 ms

The bad time 10/10.

> b7b0f3f27~1:
> 159.789 ms
> 165.407 ms
> 160.893 ms
> 159.343 ms
> 160.936 ms
> 161.577 ms
> 161.637 ms
> 163.421 ms
> 163.143 ms
> 167.109 ms

The good time 10/10.

> b7b0f3f27 + blocknums[1]:
> 164.133 ms
> 280.920 ms
> 160.748 ms
> 163.182 ms
> 161.709 ms
> 161.998 ms
> 161.239 ms
> 276.256 ms
> 161.601 ms
> 160.384 ms

The good time 8/10, the bad time 2/10.

Thanks for checking!  I bet all branches can show that flip/flop
instability in these adverse conditions, depending on random
scheduling details.  I will start a new thread with a patch for the
root cause of that, ie problem #2 (this will need back-patching), and
post a fix for #3 (v17 blocknums[N] tweak affecting
fairness/likelihood, which was probably basically a bit of ill-advised
premature optimisation) here in a few days.



Re: Streaming read-ready sequential scan code

From
Melanie Plageman
Date:
Thank you to all of you for looking into  this.

On Sat, May 18, 2024 at 12:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, May 18, 2024 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Andres happened to have TPC-DS handy, and reproduced that regression
> > in q15.  We tried some stuff and figured out that it requires
> > parallel_leader_participation=on, ie that this looks like some kind of
> > parallel fairness and/or timing problem.  It seems to be a question of
> > which worker finishes up processing matching rows, and the leader gets
> > a ~10ms head start but may be a little more greedy with the new
> > streaming code.  He tried reordering the table contents and then saw
> > 17 beat 16.  So for q15, initial indications are that this isn't a
> > fundamental regression, it's just a test that is sensitive to some
> > arbitrary conditions.
> >
> > I'll try to figure out some more details about that, ie is it being
> > too greedy on small-ish tables,
>
> After more debugging, we learned a lot more things...
>
> 1.  That query produces spectacularly bad estimates, so we finish up
> having to increase the number of buckets in a parallel hash join many
> times.  That is quite interesting, but unrelated to new code.
> 2.  Parallel hash join is quite slow at negotiating an increase in the
> number of hash bucket, if all of the input tuples are being filtered
> out by quals, because of the choice of where workers check for
> PHJ_GROWTH_NEED_MORE_BUCKETS.  That could be improved quite easily I
> think.  I have put that on my todo list 'cause that's also my code,
> but it's not a new issue it's just one that is now highlighted...
> 3.  This bit of read_stream.c is exacerbating unfairness in the
> underlying scan, so that 1 and 2 come together and produce a nasty
> slowdown, which goes away if you change it like so:
>
> -       BlockNumber blocknums[16];
> +       BlockNumber blocknums[1];
>
> I will follow up after some more study.

So, if you are seeing the slow-down mostly go away by reducing
blocknums array size, does the regression only appear when the scan
data is fully in shared buffers? Or is this blocknums other use
(dealing with short reads)?

Is your theory that one worker ends up reading 16 blocks that should
have been distributed across multiple workers?

- Melanie



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Tue, May 21, 2024 at 9:11 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> So, if you are seeing the slow-down mostly go away by reducing
> blocknums array size, does the regression only appear when the scan
> data is fully in shared buffers? Or is this blocknums other use
> (dealing with short reads)?

That must be true (that blocknums array is normally only "filled" in
the "fast path", where all buffers are found in cache).

> Is your theory that one worker ends up reading 16 blocks that should
> have been distributed across multiple workers?

Yes, it just jiggles the odds around a bit, introducing a bit of extra
unfairness by calling the callback in a tighter loop to build a little
batch, revealing a pre-existing problem.

The mistake in PHJ (problem #2 above) is that, once a worker decides
it would like all workers to stop inserting so it can increase the
number of buckets, it sets a flag to ask them to do that, and waits
for them to see it, but if there is a worker filtering all tuples out,
it never checks the "growth" flag.  So it scans all the way to the end
while the other guy waits.  Normally it checks that flag when it is
time to allocate a new chunk of memory, which seemed to make sense to
me at the time: if we've hit the needs-more-buckets (or
needs-more-batches) logic, then surely workers are inserting tuples
and will soon allocate a new chunk!  But, of course, here is the edge
case where that isn't true: we had bad estimates so hash table too
small (problem #1), we got lots of tuples clustered over a few heap
pages and decided to expand the hash table, but right at that moment,
matching tuples ran out so somebody had to finish the whole scan
without ever checking the flag (problem #2), and that someone happened
to have all the rest of the pages because we made the lookahead a bit
less fair (problem #3).  Nice confluence of problems.  I expect #2 and
#3 to be easy to fix, and I didn't look at the estimation problem #1
at all (perhaps a stats puzzle designed by the TPC to trip us up?).



Re: Streaming read-ready sequential scan code

From
Alexander Lakhin
Date:
Hello Thomas,

27.08.2024 09:52, Thomas Munro wrote:
> Here's a really simple way to see the new unfairness at the end of a
> parallel scan:
>
> drop table if exists t;
> create table t (i int);
> insert into t select generate_series(1, 100000);
> alter table t set (parallel_workers = 2);
> set parallel_setup_cost = 0;
> set parallel_leader_participation = off;
> explain (analyze, buffers, verbose) select count(*) from t;
>
> On my machine, unpatched master shows:
>
>                       Worker 0:  actual time=0.036..12.452 rows=51076 loops=1
>                         Buffers: shared hit=226
>                       Worker 1:  actual time=0.037..12.003 rows=48924 loops=1
>                         Buffers: shared hit=217
>
> The attached patch, which I'd like to push, is effectively what
> Alexander tested (blocknums[16] -> blocknums[1]).  There's no point in
> using an array of size 1, so I've turned it into a simple variable and
> deleted the relevant comments.  My machine shows:
>
>                       Worker 0:  actual time=0.038..12.115 rows=49946 loops=1
>                         Buffers: shared hit=221
>                       Worker 1:  actual time=0.038..12.109 rows=50054 loops=1
>                         Buffers: shared hit=222
>
> That difference may not seem huge, but other pre-existing things are
> going pathologically wrong in the reported query that magnify it (see
> my earlier analysis).  It's an interesting problem that will require
> more study (my earlier analysis missed a detail that I'll write about
> separately), but it doesn't seem to be new or have easy fixes, so that
> will have to be for later work.

I've tried your query and could not get sustainable results, unfortunately.
The following script:
rm -rf "$PGDATA"; initdb -D "$PGDATA" >initdb.log 2>&1

pg_ctl -s -l server.log start

cat << EOF | psql | grep 'Parallel Seq Scan' -A10 | grep 'Worker ' -A1
create table t (i int);
insert into t select generate_series(1, 100000);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;
EOF

pg_ctl -s stop

gives me unstable numbers on unpatched master:
                      Worker 0:  actual time=0.024..5.814 rows=51076 loops=1
                        Buffers: shared hit=226
                      Worker 1:  actual time=0.023..5.614 rows=48924 loops=1
                        Buffers: shared hit=217
---
                      Worker 0:  actual time=0.027..5.130 rows=36612 loops=1
                        Buffers: shared hit=162
                      Worker 1:  actual time=0.013..5.605 rows=63388 loops=1
                        Buffers: shared hit=281
---
                      Worker 0:  actual time=0.025..5.447 rows=47460 loops=1
                        Buffers: shared hit=210
                      Worker 1:  actual time=0.019..5.688 rows=52540 loops=1
                        Buffers: shared hit=233

and also with the patch applied:
                      Worker 0:  actual time=0.012..4.486 rows=55478 loops=1
                        Buffers: shared hit=246
                      Worker 1:  actual time=0.014..4.430 rows=44522 loops=1
                        Buffers: shared hit=197
---
                      Worker 0:  actual time=0.013..4.269 rows=55822 loops=1
                        Buffers: shared hit=247
                      Worker 1:  actual time=0.017..4.238 rows=44178 loops=1
                        Buffers: shared hit=196
---
                      Worker 0:  actual time=0.014..4.974 rows=50624 loops=1
                        Buffers: shared hit=224
                      Worker 1:  actual time=0.016..4.932 rows=49376 loops=1
                        Buffers: shared hit=219
---
                      Worker 0:  actual time=0.012..5.459 rows=65648 loops=1
                        Buffers: shared hit=291
                      Worker 1:  actual time=0.022..5.109 rows=34352 loops=1
                        Buffers: shared hit=152

Please correct me, if I'm doing something wrong.

Best regards,
Alexander



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Wed, Aug 28, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> gives me unstable numbers on unpatched master:
>                         Buffers: shared hit=226
>                         Buffers: shared hit=217

>                         Buffers: shared hit=162
>                         Buffers: shared hit=281

>                         Buffers: shared hit=210
>                         Buffers: shared hit=233

> and also with the patch applied:
>                         Buffers: shared hit=246
>                         Buffers: shared hit=197

>                         Buffers: shared hit=247
>                         Buffers: shared hit=196

>                         Buffers: shared hit=224
>                         Buffers: shared hit=219

>                         Buffers: shared hit=291
>                         Buffers: shared hit=152
>
> Please correct me, if I'm doing something wrong.

Huh.  I can reproduce what I showed with pretty low variance, on my
FreeBSD, Linux and macOS systems.  I included
parallel_leader_participation=off so that the workers would hopefully
start as closely together in time as possible, and hopefully allow
only a block or so of variation in the outcome.  If I do:

create or replace function f(i int) returns int language plpgsql
parallel safe as $$
begin
  raise notice '% pid %', clock_timestamp(), pg_backend_pid();
  return i;
end;
$$;

then

postgres=# explain (analyze) select f(i) from t limit 1;
NOTICE:  2024-08-28 16:41:32.845747+12 pid 47019
NOTICE:  2024-08-28 16:41:32.845746+12 pid 47018

shows start times differ by only a few microseconds at most, often 0.
I wonder if your system is more variable at beginning execution?
Maybe you're on a multi-socket system and forking/startup times vary
in some way I can't see on small systems or something like that?



Re: Streaming read-ready sequential scan code

From
Alexander Lakhin
Date:
28.08.2024 07:45, Thomas Munro wrote:
> Huh.  I can reproduce what I showed with pretty low variance, on my
> FreeBSD, Linux and macOS systems.  I included
> parallel_leader_participation=off so that the workers would hopefully
> start as closely together in time as possible, and hopefully allow
> only a block or so of variation in the outcome.  If I do:
>
> create or replace function f(i int) returns int language plpgsql
> parallel safe as $$
> begin
>    raise notice '% pid %', clock_timestamp(), pg_backend_pid();
>    return i;
> end;
> $$;
>
> then
>
> postgres=# explain (analyze) select f(i) from t limit 1;
> NOTICE:  2024-08-28 16:41:32.845747+12 pid 47019
> NOTICE:  2024-08-28 16:41:32.845746+12 pid 47018
>
> shows start times differ by only a few microseconds at most, often 0.
> I wonder if your system is more variable at beginning execution?
> Maybe you're on a multi-socket system and forking/startup times vary
> in some way I can't see on small systems or something like that?

I use a single-socket system with AMD Ryzen 5900X, running Ubuntu 20.04.
With no active background tasks running, I'm seeing:
NOTICE:  2024-08-28 08:17:36.917162+00 pid 320103
NOTICE:  2024-08-28 08:17:36.917163+00 pid 320102

NOTICE:  2024-08-28 08:17:40.592333+00 pid 320143
NOTICE:  2024-08-28 08:17:40.592645+00 pid 320144

With log_min_messages = DEBUG3, I get:
NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
NOTICE:  2024-08-28 08:41:59.310079+00 pid 338264

And the following messages in the server.log:
2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process "parallel worker for PID 338262"
2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process "parallel worker for PID 338262"
2024-08-28 08:41:59.305 UTC [338263] DEBUG:  InitPostgres
2024-08-28 08:41:59.305 UTC [338264] DEBUG:  InitPostgres
2024-08-28 08:41:59.309 UTC [338263] NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
2024-08-28 08:41:59.309 UTC [338263] CONTEXT:  PL/pgSQL function f(integer) line 3 at RAISE
2024-08-28 08:41:59.309 UTC [338262] NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
2024-08-28 08:41:59.309 UTC [338262] CONTEXT:  PL/pgSQL function f(integer) line 3 at RAISE
     parallel worker
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(0): 5 before_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(0): 6 on_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  proc_exit(0): 1 callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  exit(0)
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  proc_exit(-1): 0 callbacks to make
2024-08-28 08:41:59.310 UTC [338264] NOTICE:  2024-08-28 08:41:59.310079+00 pid 338264
2024-08-28 08:41:59.310 UTC [338264] CONTEXT:  PL/pgSQL function f(integer) line 3 at RAISE

It looks like the two parallel workers were started simultaneously, but
then the second one lagged behind...

Best regards,
Alexander



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Wed, Aug 28, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> 2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process "parallel worker for PID 338262"
> 2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process "parallel worker for PID 338262"

> 2024-08-28 08:41:59.305 UTC [338263] DEBUG:  InitPostgres
> 2024-08-28 08:41:59.305 UTC [338264] DEBUG:  InitPostgres

> 2024-08-28 08:41:59.309 UTC [338263] NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
> 2024-08-28 08:41:59.310 UTC [338264] NOTICE:  2024-08-28 08:41:59.310079+00 pid 338264

> It looks like the two parallel workers were started simultaneously, but
> then the second one lagged behind...

Yeah.  That's quite interesting, and must destabilise that
simple-minded demo.  I'm curious to know exactly what contention is
causing that (about 3/4 of a millisecond that I don't see and now I
want to know what it's waiting for), but it's a very crude test
lacking timer resolution in the earlier messages, and it's an
unrelated topic and a distraction.  Perhaps it explains why you saw
two different behaviours in Q15 with the patch and I didn't, though.
Really it shouldn't be so sensitive to such variations, it's obviously
a terrible plan, and TPC-DS needs a planner hacker mega-brain applied
to it; I'm going to try to nerd-snipe one...



Re: Streaming read-ready sequential scan code

From
Alexander Lakhin
Date:
Hello Thomas,

29.08.2024 01:16, Thomas Munro wrote:
> Yeah.  That's quite interesting, and must destabilise that
> simple-minded demo.  I'm curious to know exactly what contention is
> causing that (about 3/4 of a millisecond that I don't see and now I
> want to know what it's waiting for), but it's a very crude test
> lacking timer resolution in the earlier messages, and it's an
> unrelated topic and a distraction.  Perhaps it explains why you saw
> two different behaviours in Q15 with the patch and I didn't, though.
> Really it shouldn't be so sensitive to such variations, it's obviously
> a terrible plan, and TPC-DS needs a planner hacker mega-brain applied
> to it; I'm going to try to nerd-snipe one...

I looked at two perf profiles of such out-of-sync processes and found no
extra calls or whatsoever in the slow one, it just has the number of perf
samples increased proportionally. It made me suspect CPU frequency
scaling... Indeed, with the "performance" governor set and the boost mode
disabled, I'm now seeing much more stable numbers (I do this tuning before
running performance tests, but I had forgotten about that when I ran that
your test, my bad).

I'm sorry for the noise and the distraction.

Still, now I can confirm your results. Without the patch, two parallel
workers gave "Buffers: shared hit=217 / Buffers: shared hit=226" 10 times
out of 10. And with the patch, I've got
"shared hit=219 / shared hit=224" 3 times,
"shared hit=220 / shared hit=223" 4 times,
"shared hit=221 / shared hit=222" 3 times of 10.

On b7b0f3f27~1, my results are:
"shared hit=219 / shared hit=224": 2
"shared hit=220 / shared hit=223": 3
"shared hit=221 / shared hit=222": 4
"shared hit=218 / shared hit=225": 1

Best regards,
Alexander



Re: Streaming read-ready sequential scan code

From
Thomas Munro
Date:
On Fri, Aug 30, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I looked at two perf profiles of such out-of-sync processes and found no
> extra calls or whatsoever in the slow one, it just has the number of perf
> samples increased proportionally. It made me suspect CPU frequency
> scaling... Indeed, with the "performance" governor set and the boost mode
> disabled, I'm now seeing much more stable numbers (I do this tuning before
> running performance tests, but I had forgotten about that when I ran that
> your test, my bad).

Aha, mystery solved.

I have pushed the fix.  Thanks!