Thread: Streaming read-ready sequential scan code
Hi, Last year, David and I worked on a round of refactoring for heapgettup() and heapgettup_pagemode() [1]. Now that the streaming read API has been proposed [2], there is a bit more refactoring that can be done on master to prepare sequential scan to support streaming reads. Patches 0001 and 0002 in the attached patchset do this new round of refactoring. 0003 is the remainder of the streaming read API that is not yet in master. 0004 is the sequential scan streaming read user. The primary change needed to be able to drop in streaming read support was that heapgettup() and heapgettup_pagemode() have to wait for there to be no more valid buffers instead of waiting until there were no more valid BlockNumbers to know that the relation has been entirely processed. Naturally, streaming reads prefetch ahead of the block being currently processed by the scan, so all blocks should have been requested long before all blocks have been processed. To change this, I split up heapgetpage() into two functions -- one responsible for getting blocks into buffers and the other for processing a page (pruning, checking tuple visibility, etc). As a consequence, I had to change the other caller of heapgetpage() (sample scans). Since I was doing this anyway, I made a few changes there. It is arguable that those changes could be split up differently between 0001 and 0004. However, I wanted 0004 to be *only* the sequential scan streaming read user code. There is an outstanding question about where to allocate the PgStreamingRead object for sequential scans (see TODO in 0004). However, I thought I would keep this thread focused on 0001 and 0002. Though logically the performance with 0001 and 0002 should be the same as master (no new non-inline function calls, no additional looping), I've done a bit of profiling anyway. I created a large multi-GB table, read it all into shared buffers (disabling the large sequential scan bulkread optimization), and did a sequential SELECT count(*) from the table. From the profiles below, you'll notice that master and the patch are basically the same. Actual percentages vary from run-to-run. Execution time is the same. patch 15.49% postgres postgres [.] ExecInterpExpr 11.03% postgres postgres [.] heapgettup_pagemode 10.85% postgres postgres [.] ExecStoreBufferHeapTuple 9.14% postgres postgres [.] heap_getnextslot 8.39% postgres postgres [.] heapbuildvis 6.47% postgres postgres [.] SeqNext master 14.16% postgres postgres [.] ExecInterpExpr 11.54% postgres postgres [.] heapgettup_pagemode 10.63% postgres postgres [.] ExecStoreBufferHeapTuple 10.22% postgres postgres [.] heap_getnextslot 8.53% postgres postgres [.] heapgetpage 5.35% postgres postgres [.] SeqNext - Melanie [1] https://www.postgresql.org/message-id/flat/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
Attachment
On Tue, 30 Jan 2024 at 10:17, Melanie Plageman <melanieplageman@gmail.com> wrote: > Though logically the performance with 0001 and 0002 should be the same > as master (no new non-inline function calls, no additional looping), > I've done a bit of profiling anyway. I created a large multi-GB table, > read it all into shared buffers (disabling the large sequential scan > bulkread optimization), and did a sequential SELECT count(*) from the > table. From the profiles below, you'll notice that master and the > patch are basically the same. Actual percentages vary from run-to-run. > Execution time is the same. Can you also run a test on a Seqscan with a filter that filters out all tuples? There's less overhead in other parts of the executor with such a query. David
On Mon, Jan 29, 2024 at 4:24 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Tue, 30 Jan 2024 at 10:17, Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Though logically the performance with 0001 and 0002 should be the same > > as master (no new non-inline function calls, no additional looping), > > I've done a bit of profiling anyway. I created a large multi-GB table, > > read it all into shared buffers (disabling the large sequential scan > > bulkread optimization), and did a sequential SELECT count(*) from the > > table. From the profiles below, you'll notice that master and the > > patch are basically the same. Actual percentages vary from run-to-run. > > Execution time is the same. > > Can you also run a test on a Seqscan with a filter that filters out > all tuples? There's less overhead in other parts of the executor with > such a query. Yes, of course. Thank you so much for taking a look! While I was at it, I changed the table schema to be entirely composed of INT type columns and regenerated the data. Note that, both in this example and my previous example, I ensured that the table was vacuumed beforehand (and autovacuum disabled for the table) so there wasn't any on-access pruning happening (heapgetpage() does that in pagemode). This is the schema CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f INT, g INT) with (autovacuum_enabled=false); I added 46000000 rows to the table, making it 2.6 GB. Shared buffers is double that. Before profiling, I did a SELECT * from the table with the large sequential scan bulkread optimization disabled. Then I vacuumed the table. Finally, I turned up parallel_setup_cost high enough to disable query parallelism. The query I profiled was: SELECT * FROM foo WHERE id = 0; With the data I generated, 0 rows match that condition. Profiles below. Execution time essentially the same. patch: 17.08% postgres postgres [.] ExecInterpExpr 11.17% postgres postgres [.] tts_buffer_heap_getsomeattrs 10.64% postgres postgres [.] ExecStoreBufferHeapTuple 9.82% postgres postgres [.] heap_getnextslot 9.13% postgres postgres [.] heapgettup_pagemode 8.98% postgres postgres [.] heapbuildvis 5.40% postgres postgres [.] HeapCheckForSerializableConflictOut 5.16% postgres postgres [.] SeqNext master: 17.89% postgres postgres [.] ExecInterpExpr 12.28% postgres postgres [.] tts_buffer_heap_getsomeattrs 10.54% postgres postgres [.] ExecStoreBufferHeapTuple 10.11% postgres postgres [.] heapgettup_pagemode 8.52% postgres postgres [.] heapgetpage 8.28% postgres postgres [.] heap_getnextslot 5.00% postgres postgres [.] HeapCheckForSerializableConflictOut 4.71% postgres postgres [.] SeqNext - Melanie
On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > There is an outstanding question about where to allocate the > PgStreamingRead object for sequential scans I've written three alternative implementations of the actual streaming read user for sequential scan which handle the question of where to allocate the streaming read object and how to handle changing scan direction in different ways. Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation - Allocates the streaming read object in initscan(). Since we do not know the scan direction at this time, if the scan ends up not being a forwards scan, the streaming read object must later be freed -- so this will sometimes allocate a streaming read object it never uses. - Only supports ForwardScanDirection and once the scan direction changes, streaming is never supported again -- even if we return to ForwardScanDirection - Must maintain a "fallback" codepath that does not use the streaming read API Option B) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_heapgettup_alloc_forward_only - Allocates the streaming read object in heapgettup[_pagemode]() when it has not been previously allocated. To do this it has to record and switch into a different memory context than the per-tuple context. It only allocates the streaming read object if it is a forwards scan. It frees the streaming read object if the scan direction is later changed. - Only supports ForwardScanDirection and once the scan direction changes, streaming is never supported again -- even if we return to ForwardScanDirection - Must maintain a "fallback" codepath that does not use the streaming read API Option C) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_all_dir_stream - Allocates the streaming read object in heapgettup[_pagemode]() when it has not been previously allocated. To do this it has to record and switch into a different memory context than the per-tuple context. - All scan directions support streaming. To do this, the scan direction has to be tracked and we must check if the direction has changed on every heapgettup[_pagemode]() invocation to avoid returning wrong results. - No "fallback" codepath as all heap sequential scans will use the streaming read API As you can see, each option has pros and cons. I'm interested in what others think about which we should choose. - Melanie
On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I've written three alternative implementations of the actual streaming > read user for sequential scan which handle the question of where to > allocate the streaming read object and how to handle changing scan > direction in different ways. It's weird to me that the prospect of changing the scan direction causes such complexity. I mean, why doesn't a streaming read object have a forget_all_my_previous_requests() method or somesuch? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 20, 2024 at 6:13 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I've written three alternative implementations of the actual streaming > > read user for sequential scan which handle the question of where to > > allocate the streaming read object and how to handle changing scan > > direction in different ways. > > It's weird to me that the prospect of changing the scan direction > causes such complexity. I mean, why doesn't a streaming read object > have a forget_all_my_previous_requests() method or somesuch? Basically, that is what pg_streaming_read_free() does. It goes through and releases the buffers it had pinned and frees any per-buffer data allocated. The complexity with the sequential scan streaming read user and scan direction is just that it has to detect when the scan direction changes and do the releasing/freeing and reallocation. The scan direction is passed to heapgettup[_pagemode](), so this is something that can change on a tuple-to-tuple basis. It is less that doing this is complicated and more that it is annoying and distracting to have to check for and handle a very unimportant and uncommon case in the main path of the common case. - Melanie
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > There is an outstanding question about where to allocate the > > PgStreamingRead object for sequential scans > > I've written three alternative implementations of the actual streaming > read user for sequential scan which handle the question of where to > allocate the streaming read object and how to handle changing scan > direction in different ways. > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > - Allocates the streaming read object in initscan(). Since we do not > know the scan direction at this time, if the scan ends up not being a > forwards scan, the streaming read object must later be freed -- so > this will sometimes allocate a streaming read object it never uses. > - Only supports ForwardScanDirection and once the scan direction > changes, streaming is never supported again -- even if we return to > ForwardScanDirection > - Must maintain a "fallback" codepath that does not use the streaming read API Attached is a version of this patch which implements a "reset" function for the streaming read API which should be cheaper than the full pg_streaming_read_free() on rescan. This can easily be ported to work on any of my proposed implementations (A/B/C). I implemented it on A as an example. - Melanie
Attachment
On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > There is an outstanding question about where to allocate the > > > PgStreamingRead object for sequential scans > > > > I've written three alternative implementations of the actual streaming > > read user for sequential scan which handle the question of where to > > allocate the streaming read object and how to handle changing scan > > direction in different ways. > > > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > - Allocates the streaming read object in initscan(). Since we do not > > know the scan direction at this time, if the scan ends up not being a > > forwards scan, the streaming read object must later be freed -- so > > this will sometimes allocate a streaming read object it never uses. > > - Only supports ForwardScanDirection and once the scan direction > > changes, streaming is never supported again -- even if we return to > > ForwardScanDirection > > - Must maintain a "fallback" codepath that does not use the streaming read API > > Attached is a version of this patch which implements a "reset" > function for the streaming read API which should be cheaper than the > full pg_streaming_read_free() on rescan. This can easily be ported to > work on any of my proposed implementations (A/B/C). I implemented it > on A as an example. Attached is the latest version of this patchset -- rebased in light of Thomas' updatees to the streaming read API [1]. I have chosen the approach I think we should go with. It is a hybrid of my previously proposed approaches. The streaming read is allocated in heap_beginscan() and then reset on rescan and when the scan direction changes. I only check if the scan direction changes when a new page is needed. This implementation means no fallback method is needed, so we can remove the non-streaming read code for heap sequential scans. Because heapgettup() and heapgettup_pagemode() are also used for TID range scans, this patch also happens to implement streaming reads for TID range scans. - Melanie [1] https://www.postgresql.org/message-id/CA%2BhUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ%40mail.gmail.com
Attachment
On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > <melanieplageman@gmail.com> wrote: > > > > > > > > There is an outstanding question about where to allocate the > > > > PgStreamingRead object for sequential scans > > > > > > I've written three alternative implementations of the actual streaming > > > read user for sequential scan which handle the question of where to > > > allocate the streaming read object and how to handle changing scan > > > direction in different ways. > > > > > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > - Allocates the streaming read object in initscan(). Since we do not > > > know the scan direction at this time, if the scan ends up not being a > > > forwards scan, the streaming read object must later be freed -- so > > > this will sometimes allocate a streaming read object it never uses. > > > - Only supports ForwardScanDirection and once the scan direction > > > changes, streaming is never supported again -- even if we return to > > > ForwardScanDirection > > > - Must maintain a "fallback" codepath that does not use the streaming read API > > > > Attached is a version of this patch which implements a "reset" > > function for the streaming read API which should be cheaper than the > > full pg_streaming_read_free() on rescan. This can easily be ported to > > work on any of my proposed implementations (A/B/C). I implemented it > > on A as an example. > > Attached is the latest version of this patchset -- rebased in light of > Thomas' updatees to the streaming read API [1]. I have chosen the > approach I think we should go with. It is a hybrid of my previously > proposed approaches. While investigating some performance concerns, Andres pointed out that the members I add to HeapScanDescData in this patch push rs_cindex and rs_ntuples to another cacheline and introduce a 4-byte hole. Attached v4's HeapScanDescData is as well-packed as on master and its members are reordered so that rs_cindex and rs_ntuples are back on the second cacheline of the struct's data. - Melanie
Attachment
On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote: > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > > <melanieplageman@gmail.com> wrote: > > > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > > <melanieplageman@gmail.com> wrote: > > > > > > > > > > There is an outstanding question about where to allocate the > > > > > PgStreamingRead object for sequential scans > > > > > > > > I've written three alternative implementations of the actual streaming > > > > read user for sequential scan which handle the question of where to > > > > allocate the streaming read object and how to handle changing scan > > > > direction in different ways. > > > > > > > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > > - Allocates the streaming read object in initscan(). Since we do not > > > > know the scan direction at this time, if the scan ends up not being a > > > > forwards scan, the streaming read object must later be freed -- so > > > > this will sometimes allocate a streaming read object it never uses. > > > > - Only supports ForwardScanDirection and once the scan direction > > > > changes, streaming is never supported again -- even if we return to > > > > ForwardScanDirection > > > > - Must maintain a "fallback" codepath that does not use the streaming read API > > > > > > Attached is a version of this patch which implements a "reset" > > > function for the streaming read API which should be cheaper than the > > > full pg_streaming_read_free() on rescan. This can easily be ported to > > > work on any of my proposed implementations (A/B/C). I implemented it > > > on A as an example. > > > > Attached is the latest version of this patchset -- rebased in light of > > Thomas' updatees to the streaming read API [1]. I have chosen the > > approach I think we should go with. It is a hybrid of my previously > > proposed approaches. > > While investigating some performance concerns, Andres pointed out that > the members I add to HeapScanDescData in this patch push rs_cindex and > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached > v4's HeapScanDescData is as well-packed as on master and its members > are reordered so that rs_cindex and rs_ntuples are back on the second > cacheline of the struct's data. I did some additional profiling and realized that dropping the unlikely() from the places we check rs_inited frequently was negatively impacting performance. v5 adds those back and also makes a few other very minor cleanups. Note that this patch set has a not yet released version of Thomas Munro's Streaming Read API with a new ramp-up logic which seems to fix a performance issue I saw with my test case when all of the sequential scan's blocks are in shared buffers. Once he sends the official new version, I will rebase this and point to his explanation in that thread. - Melanie
Attachment
On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote: > > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > > > <melanieplageman@gmail.com> wrote: > > > > > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > > > <melanieplageman@gmail.com> wrote: > > > > > > > > > > > > There is an outstanding question about where to allocate the > > > > > > PgStreamingRead object for sequential scans > > > > > > > > > > I've written three alternative implementations of the actual streaming > > > > > read user for sequential scan which handle the question of where to > > > > > allocate the streaming read object and how to handle changing scan > > > > > direction in different ways. > > > > > > > > > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > > > - Allocates the streaming read object in initscan(). Since we do not > > > > > know the scan direction at this time, if the scan ends up not being a > > > > > forwards scan, the streaming read object must later be freed -- so > > > > > this will sometimes allocate a streaming read object it never uses. > > > > > - Only supports ForwardScanDirection and once the scan direction > > > > > changes, streaming is never supported again -- even if we return to > > > > > ForwardScanDirection > > > > > - Must maintain a "fallback" codepath that does not use the streaming read API > > > > > > > > Attached is a version of this patch which implements a "reset" > > > > function for the streaming read API which should be cheaper than the > > > > full pg_streaming_read_free() on rescan. This can easily be ported to > > > > work on any of my proposed implementations (A/B/C). I implemented it > > > > on A as an example. > > > > > > Attached is the latest version of this patchset -- rebased in light of > > > Thomas' updatees to the streaming read API [1]. I have chosen the > > > approach I think we should go with. It is a hybrid of my previously > > > proposed approaches. > > > > While investigating some performance concerns, Andres pointed out that > > the members I add to HeapScanDescData in this patch push rs_cindex and > > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached > > v4's HeapScanDescData is as well-packed as on master and its members > > are reordered so that rs_cindex and rs_ntuples are back on the second > > cacheline of the struct's data. > > I did some additional profiling and realized that dropping the > unlikely() from the places we check rs_inited frequently was negatively > impacting performance. v5 adds those back and also makes a few other > very minor cleanups. > > Note that this patch set has a not yet released version of Thomas > Munro's Streaming Read API with a new ramp-up logic which seems to fix a > performance issue I saw with my test case when all of the sequential > scan's blocks are in shared buffers. Once he sends the official new > version, I will rebase this and point to his explanation in that thread. Attached v6 has the version of the streaming read API mentioned here [1]. This resolved the fully-in-shared-buffers regressions investigated in that thread by Andres, Bilal, and Thomas. The one outstanding item for the sequential scan streaming read user is deciding how the BAS_BULKREAD buffer access strategy should interact with the streaming read infrastructure. We discussed a bit off-list, and it seems clear that the ring must be at least as large as io_combine_limit. This should be no problem for BAS_BULKREAD because its ring is 16 MB. The question is whether or not we need to do anything right now to ensure there aren't adverse interactions between io_combine_limit, max_ios, and the buffer access strategy ring buffer size. - Melanie [1] https://www.postgresql.org/message-id/CA%2BhUKGJTwrS7F%3DuJPx3SeigMiQiW%2BLJaOkjGyZdCntwyMR%3DuAw%40mail.gmail.com
Attachment
- v6-0001-Split-heapgetpage-into-two-parts.patch
- v6-0002-Replace-blocks-with-buffers-in-heapgettup-control.patch
- v6-0004-Provide-API-for-streaming-relation-data.patch
- v6-0003-Provide-vectored-variant-of-ReadBuffer.patch
- v6-0005-Add-read_stream_reset.patch
- v6-0006-Sequential-scans-and-TID-range-scans-stream-reads.patch
On Wed, Mar 27, 2024 at 08:47:03PM -0400, Melanie Plageman wrote: > On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote: > > > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman > > > <melanieplageman@gmail.com> wrote: > > > > > > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > > > > <melanieplageman@gmail.com> wrote: > > > > > > > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > > > > <melanieplageman@gmail.com> wrote: > > > > > > > > > > > > > > There is an outstanding question about where to allocate the > > > > > > > PgStreamingRead object for sequential scans > > > > > > > > > > > > I've written three alternative implementations of the actual streaming > > > > > > read user for sequential scan which handle the question of where to > > > > > > allocate the streaming read object and how to handle changing scan > > > > > > direction in different ways. > > > > > > > > > > > > Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > > > > - Allocates the streaming read object in initscan(). Since we do not > > > > > > know the scan direction at this time, if the scan ends up not being a > > > > > > forwards scan, the streaming read object must later be freed -- so > > > > > > this will sometimes allocate a streaming read object it never uses. > > > > > > - Only supports ForwardScanDirection and once the scan direction > > > > > > changes, streaming is never supported again -- even if we return to > > > > > > ForwardScanDirection > > > > > > - Must maintain a "fallback" codepath that does not use the streaming read API > > > > > > > > > > Attached is a version of this patch which implements a "reset" > > > > > function for the streaming read API which should be cheaper than the > > > > > full pg_streaming_read_free() on rescan. This can easily be ported to > > > > > work on any of my proposed implementations (A/B/C). I implemented it > > > > > on A as an example. > > > > > > > > Attached is the latest version of this patchset -- rebased in light of > > > > Thomas' updatees to the streaming read API [1]. I have chosen the > > > > approach I think we should go with. It is a hybrid of my previously > > > > proposed approaches. > > > > > > While investigating some performance concerns, Andres pointed out that > > > the members I add to HeapScanDescData in this patch push rs_cindex and > > > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached > > > v4's HeapScanDescData is as well-packed as on master and its members > > > are reordered so that rs_cindex and rs_ntuples are back on the second > > > cacheline of the struct's data. > > > > I did some additional profiling and realized that dropping the > > unlikely() from the places we check rs_inited frequently was negatively > > impacting performance. v5 adds those back and also makes a few other > > very minor cleanups. > > > > Note that this patch set has a not yet released version of Thomas > > Munro's Streaming Read API with a new ramp-up logic which seems to fix a > > performance issue I saw with my test case when all of the sequential > > scan's blocks are in shared buffers. Once he sends the official new > > version, I will rebase this and point to his explanation in that thread. > > Attached v6 has the version of the streaming read API mentioned here > [1]. This resolved the fully-in-shared-buffers regressions > investigated in that thread by Andres, Bilal, and Thomas. Attached v7 has version 14 of the streaming read API as well as a few small tweaks to comments and code. I noticed that 0001 in the set posed a small regression from master for a sequential scan of a relation already in shared buffers. While investigating this, I saw that heapfetchbuf() was still not being inlined (compiled at -O2) and when I promoted heapfetchbuf() from static inline to static pg_attribute_always_inline, most of the very small regression I saw went away. I don't know if I squashed the issue entirely, though. - Melanie
Attachment
On 01/04/2024 22:58, Melanie Plageman wrote: > Attached v7 has version 14 of the streaming read API as well as a few > small tweaks to comments and code. I saw benchmarks in this thread to show that there's no regression when the data is in cache, but I didn't see any benchmarks demonstrating the benefit of this. So I ran this quick test: -- create table ~1 GB table with only 1 row per page. CREATE TABLE giga (i int, filler text) with (fillfactor=10); insert into giga select g, repeat('x', 900) from generate_series(1, 140000) g; vacuum freeze giga; \timing on select count(*) from giga; The SELECT takes about 390 ms on 'master', and 230 ms with the patch. This is pretty much the best case for this patch, real world gains will be much smaller. Nevertheless, nice speedup! -- Heikki Linnakangas Neon (https://neon.tech)
On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 01/04/2024 22:58, Melanie Plageman wrote: > > Attached v7 has version 14 of the streaming read API as well as a few > > small tweaks to comments and code. > > I saw benchmarks in this thread to show that there's no regression when > the data is in cache, but I didn't see any benchmarks demonstrating the > benefit of this. So I ran this quick test: Good point! It would be good to show why we would actually want this patch. Attached v8 is rebased over current master (which now has the streaming read API). On the topic of BAS_BULKREAD buffer access strategy, I think the least we could do is add an assert like this to read_stream_begin_relation() after calculating max_pinned_buffers. Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); Perhaps we should do more? I think with a ring size of 16 MB, large SELECTs are safe for now. But I know future developers will make changes and it would be good not to assume they will understand that pinning more buffers than the size of the ring effectively invalidates the ring. - Melanie
Attachment
On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 01/04/2024 22:58, Melanie Plageman wrote: > > > Attached v7 has version 14 of the streaming read API as well as a few > > > small tweaks to comments and code. > > > > I saw benchmarks in this thread to show that there's no regression when > > the data is in cache, but I didn't see any benchmarks demonstrating the > > benefit of this. So I ran this quick test: > > Good point! It would be good to show why we would actually want this > patch. Attached v8 is rebased over current master (which now has the > streaming read API). Anecdotally by all reports I've seen so far, all-in-cache seems to be consistently a touch faster than master if anything, for streaming seq scan and streaming ANALYZE. That's great!, but it doesn't seem to be due to intentional changes. No efficiency is coming from batching anything. Perhaps it has to do with CPU pipelining effects: though it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work itself is cut up into stages in a kind of pipeline: read_stream_next_buffer() chooses page n + 2, pins page n + 1 and returns page n each time you call it, so maybe we get more CPU parallelism due to spreading the data dependencies out? (Makes me wonder what happens if you insert a memory prefetch for the page header into that production line, and if there are more opportunities to spread dependencies out eg hashing the buffer tag ahead of time.) > On the topic of BAS_BULKREAD buffer access strategy, I think the least > we could do is add an assert like this to read_stream_begin_relation() > after calculating max_pinned_buffers. > > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); > > Perhaps we should do more? I think with a ring size of 16 MB, large > SELECTs are safe for now. But I know future developers will make > changes and it would be good not to assume they will understand that > pinning more buffers than the size of the ring effectively invalidates > the ring. Yeah I think we should clamp max_pinned_buffers if we see a strategy. What do you think about: if (strategy) { int strategy_buffers = GetAccessStrategyBufferCount(strategy); max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers); } I just don't know where to get that '2'. The idea would be to hopefully never actually be constrained by it in practice, except in tiny/toy setups, so we can't go too wrong with our number '2' there. Then we should increase the default ring sizes for BAS_BULKREAD and BAS_VACUUM to make the previous statement true. The size of main memory and L2 cache have increased dramatically since those strategies were invented. I think we should at least double them, and more likely quadruple them. I realise you already made them configurable per command in commit 1cbbee033, but I mean the hard coded default 256 in freelist.c. It's not only to get more space for our prefetching plans, it's also to give the system more chance of flushing WAL asynchronously/in some other backend before you crash into dirty data; as you discovered, prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy, by taking away space that is effectively deferring WAL flushes, so I'd at least like to double the size for if we use "/ 2" above.
On Wed, Apr 3, 2024 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > On the topic of BAS_BULKREAD buffer access strategy, I think the least > > we could do is add an assert like this to read_stream_begin_relation() > > after calculating max_pinned_buffers. > > > > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); > > > > Perhaps we should do more? I think with a ring size of 16 MB, large > > SELECTs are safe for now. But I know future developers will make > > changes and it would be good not to assume they will understand that > > pinning more buffers than the size of the ring effectively invalidates > > the ring. > > Yeah I think we should clamp max_pinned_buffers if we see a strategy. > What do you think about: > > if (strategy) > { > int strategy_buffers = GetAccessStrategyBufferCount(strategy); > max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers); > } > > I just don't know where to get that '2'. The idea would be to > hopefully never actually be constrained by it in practice, except in > tiny/toy setups, so we can't go too wrong with our number '2' there. Yea, I don't actually understand why not just use strategy_buffers - 1 or something. 1/2 seems like a big limiting factor for those strategies with small rings. I don't really think it will come up for SELECT queries since they rely on readahead and not prefetching. It does seem like it could easily come up for analyze. But I am on board with the idea of clamping for sequential scan/TID range scan. For vacuum, we might have to think harder. If the user specifies buffer_usage_limit and io_combine_limit and they are never reaching IOs of io_combine_limit because of their buffer_usage_limit value, then we should probably inform them. - Melanie
Hi, On 2024-04-04 09:27:35 +1300, Thomas Munro wrote: > Anecdotally by all reports I've seen so far, all-in-cache seems to be > consistently a touch faster than master if anything, for streaming seq > scan and streaming ANALYZE. That's great!, but it doesn't seem to be > due to intentional changes. No efficiency is coming from batching > anything. Perhaps it has to do with CPU pipelining effects: though > it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work > itself is cut up into stages in a kind of pipeline: > read_stream_next_buffer() chooses page n + 2, pins page n + 1 and > returns page n each time you call it, so maybe we get more CPU > parallelism due to spreading the data dependencies out? Another theory is that it's due to the plain ReadBuffer() path needing to do RelationGetSmgr(reln) on every call, whereas the streaming read path doesn't need to. > > On the topic of BAS_BULKREAD buffer access strategy, I think the least > > we could do is add an assert like this to read_stream_begin_relation() > > after calculating max_pinned_buffers. > > > > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); > > > > Perhaps we should do more? I think with a ring size of 16 MB, large > > SELECTs are safe for now. But I know future developers will make > > changes and it would be good not to assume they will understand that > > pinning more buffers than the size of the ring effectively invalidates > > the ring. > > Yeah I think we should clamp max_pinned_buffers if we see a strategy. > What do you think about: > > if (strategy) > { > int strategy_buffers = GetAccessStrategyBufferCount(strategy); > max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers); > } > I just don't know where to get that '2'. The idea would be to > hopefully never actually be constrained by it in practice, except in > tiny/toy setups, so we can't go too wrong with our number '2' there. The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so, should we apply that only if the ring the strategy doesn't use the StrategyRejectBuffer() logic? I think it's fine to add a handwavy justification for the 2, saying that we want to balance readahead distance and reducing WAL write frequency, and that at some point more sophisticated logic will be needed. > Then we should increase the default ring sizes for BAS_BULKREAD and > BAS_VACUUM to make the previous statement true. I'm not sure it's right tying them together. The concerns for both are fairly different. > The size of main memory and L2 cache have increased dramatically since those > strategies were invented. I think we should at least double them, and more > likely quadruple them. I realise you already made them configurable per > command in commit 1cbbee033, but I mean the hard coded default 256 in > freelist.c. It's not only to get more space for our prefetching plans, it's > also to give the system more chance of flushing WAL asynchronously/in some > other backend before you crash into dirty data; as you discovered, > prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy, > by taking away space that is effectively deferring WAL flushes, so I'd at > least like to double the size for if we use "/ 2" above. I think for VACUUM we should probably go a bit further. There's no comparable L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more expensive, compared to a seqscan. I'd go or at lest 4x-8x. Greetings, Andres Freund
On Thu, 4 Apr 2024 at 06:03, Melanie Plageman <melanieplageman@gmail.com> wrote: > Attached v8 is rebased over current master (which now has the > streaming read API). I've looked at the v8-0001 patch. I wasn't too keen on heapbuildvis() as a function name for an external function. Since it also does pruning work, it seemed weird to make it sound like it only did visibility work. Per our offline discussion about names, I've changed it to what you suggested which is heap_page_prep(). Aside from that, there was an outdated comment: "In pageatatime mode, heapgetpage() already did visibility checks," which was no longer true as that's done in heapbuildvis() (now heap_page_prep()). I also did a round of comment adjustments as there were a few things I didn't like, e.g: + * heapfetchbuf - subroutine for heapgettup() This is also used in heapgettup_pagemode(), so I thought it was a bad idea for a function to list places it thinks it's being called. I also thought it redundant to write "This routine" in the function head comment. I think "this routine" is implied by the context. I ended up with: /* * heapfetchbuf - read and pin the given MAIN_FORKNUM block number. * * Read the specified block of the scan relation into a buffer and pin that * buffer before saving it in the scan descriptor. */ I'm happy with your changes to heapam_scan_sample_next_block(). I did adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively the same as the seqscan version, just with "seqscan" swapped for "sample scan". The only other thing I adjusted there was to use "blockno" in some places where you were using hscan->rs_cblock. These all come after the "hscan->rs_cblock = blockno;" line. My thoughts here are that this is more likely to avoid reading the value from the struct again if the compiler isn't happy that the two values are still equivalent for some reason. Even if the compiler is happy today, it would only take a code change to pass hscan to some external function for the compiler to perhaps be uncertain if that function has made an adjustment to rs_cblock and go with the safe option of pulling the value out the struct again which is a little more expensive as it requires some maths to figure out the offset. I've attached v9-0001 and a delta of just my changes from v8. David
Attachment
On Wed, Apr 3, 2024 at 9:08 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Thu, 4 Apr 2024 at 06:03, Melanie Plageman <melanieplageman@gmail.com> wrote: > > Attached v8 is rebased over current master (which now has the > > streaming read API). > > I've looked at the v8-0001 patch. Thanks for taking a look! > I wasn't too keen on heapbuildvis() as a function name for an external > function. Since it also does pruning work, it seemed weird to make it > sound like it only did visibility work. Per our offline discussion > about names, I've changed it to what you suggested which is > heap_page_prep(). Looking at it in the code, I am wondering if we should call heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear that it is prepping a page to be scanned. You choose whatever you think is best. > Aside from that, there was an outdated comment: "In pageatatime mode, > heapgetpage() already did visibility checks," which was no longer true > as that's done in heapbuildvis() (now heap_page_prep()). > > I also did a round of comment adjustments as there were a few things I > didn't like, e.g: > > + * heapfetchbuf - subroutine for heapgettup() > > This is also used in heapgettup_pagemode(), so I thought it was a bad > idea for a function to list places it thinks it's being called. I > also thought it redundant to write "This routine" in the function head > comment. I think "this routine" is implied by the context. I ended up > with: > > /* > * heapfetchbuf - read and pin the given MAIN_FORKNUM block number. > * > * Read the specified block of the scan relation into a buffer and pin that > * buffer before saving it in the scan descriptor. > */ > > I'm happy with your changes to heapam_scan_sample_next_block(). I did > adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively > the same as the seqscan version, just with "seqscan" swapped for > "sample scan". That all is fine with me. > The only other thing I adjusted there was to use "blockno" in some > places where you were using hscan->rs_cblock. These all come after > the "hscan->rs_cblock = blockno;" line. My thoughts here are that this > is more likely to avoid reading the value from the struct again if the > compiler isn't happy that the two values are still equivalent for some > reason. Even if the compiler is happy today, it would only take a > code change to pass hscan to some external function for the compiler > to perhaps be uncertain if that function has made an adjustment to > rs_cblock and go with the safe option of pulling the value out the > struct again which is a little more expensive as it requires some > maths to figure out the offset. > > I've attached v9-0001 and a delta of just my changes from v8. All sounds good and LGTM - Melanie
On Thu, 4 Apr 2024 at 14:38, Melanie Plageman <melanieplageman@gmail.com> wrote: > Looking at it in the code, I am wondering if we should call > heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear > that it is prepping a page to be scanned. You choose whatever you > think is best. I ended up calling it heap_prepare_pagescan() as I started to think prep/prepare should come first. I don't think it's perfect as the intended meaning is heap_prepare_page_for_scanning_in_pagemode(), but that's obviously too long. I've pushed the v9-0001 with that rename done. David
On Thu, 4 Apr 2024 at 16:45, David Rowley <dgrowleyml@gmail.com> wrote: > I've pushed the v9-0001 with that rename done. I've now just pushed the 0002 patch with some revisions: 1. The function declarations you added for heapgettup_advance_block() and heapgettup_initial_block() didn't match the properties of their definitions. You'd declared both of these static inline but neither of these were. 2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as that's effectively what it does with v8-0002, however, that's just too many words all shoved together, so I renamed it to heap_fetch_next_buffer(). 3. I changed heapgettup_initial_block() to pg_noinline as it both makes more sense to have this out of line and it also fixed a small performance regression. Looks like I also failed to grep for all the remaining instances of "heapgetpage" in 44086b097. Those are now fixed by 3a4a3537a. I also rebased the 0003 patch which I've attached as a raw patch. FWIW, using Heikki's test in [1] with a pg_prewarm each time after restarting the instance. No parallel aggregate. drowley@amd3990x:~$ cat bench.sql select count(*) from giga; drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency 44086b097~1 latency average = 34.323 ms latency average = 34.332 ms 44086b097 latency average = 34.811 ms latency average = 34.862 ms 3a4a3537a latency average = 34.497 ms latency average = 34.538 ms 3a4a3537a + read_stream_for_seqscans.patch latency average = 40.923 ms latency average = 41.415 ms i.e. no meaningful change from the refactor, but a regression from a cached workload that changes the page often without doing much work in between with the read stread patch. I'm happy to run further benchmarks, but for the remainder of the committer responsibility for the next patches, I'm going to leave that to Thomas. David [1] https://www.postgresql.org/message-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1@iki.fi
Attachment
On Thu, Apr 4, 2024 at 11:13 AM Andres Freund <andres@anarazel.de> wrote: > The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so, > should we apply that only if the ring the strategy doesn't use the > StrategyRejectBuffer() logic? Hmm, I don't really know, but that sounds plausible. What do you think about the attached? > I think for VACUUM we should probably go a bit further. There's no comparable > L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more > expensive, compared to a seqscan. I'd go or at lest 4x-8x. Alright what about this?
Attachment
On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Alright what about this? Forgot to git add a change, new version.
Attachment
On Thu, Apr 4, 2024 at 8:02 PM David Rowley <dgrowleyml@gmail.com> wrote: > 3a4a3537a > latency average = 34.497 ms > latency average = 34.538 ms > > 3a4a3537a + read_stream_for_seqscans.patch > latency average = 40.923 ms > latency average = 41.415 ms > > i.e. no meaningful change from the refactor, but a regression from a > cached workload that changes the page often without doing much work in > between with the read stread patch. I ran Heikki's test except I ran the "insert" 4 times to get a table of 4376MB according to \d+. On my random cloud ARM server (SB=8GB, huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms when the data is in LInux cache and PG is not prewarmed, roughly as he reported. Good. If I pg_prewarm first, I see that slowdown 260ms -> 294ms. Trying things out to see what works, I got that down to 243ms (ie beat master) by inserting a memory prefetch: --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) /* Prepare for the next call. */ read_stream_look_ahead(stream, false); + __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index])); Maybe that's a solution to a different problem that just happens to more than make up the difference in this case, and it may be questionable whether that cache line will survive long enough to help you, but this one-tuple-per-page test likes it... Hmm, to get a more realistic table than the one-tuple-per-page on, I tried doubling a tenk1 table until it reached 2759MB and then I got a post-prewarm regression 702ms -> 721ms, and again I can beat master by memory prefetching: 689ms. Annoyingly, with the same table I see no difference between the actual pg_prewarm('giga') time: around 155ms for both. pg_prewarm is able to use the 'fast path' I made pretty much just to be able to minimise regression in that (probably stupid) all-cached tes that doesn't even look at the page contents. Unfortunately seq scan can't use it because it has per-buffer data, which is one of the things it can't do (because of a space management issue). Maybe I should try to find a way to fix that. > I'm happy to run further benchmarks, but for the remainder of the > committer responsibility for the next patches, I'm going to leave that > to Thomas. Thanks!
On Thu, Apr 4, 2024 at 3:02 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Thu, 4 Apr 2024 at 16:45, David Rowley <dgrowleyml@gmail.com> wrote: > > I've pushed the v9-0001 with that rename done. > > I've now just pushed the 0002 patch with some revisions: Thanks! > 1. The function declarations you added for heapgettup_advance_block() > and heapgettup_initial_block() didn't match the properties of their > definitions. You'd declared both of these static inline but neither > of these were. Ah, they needed to be defined static but I intentionally left off the inline in the definition and only put it in the forward declaration because I thought that was correct. Anyway, I'm fine with how you did it in the end. > 2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as > that's effectively what it does with v8-0002, however, that's just too > many words all shoved together, so I renamed it to > heap_fetch_next_buffer(). Sounds good. > 3. I changed heapgettup_initial_block() to pg_noinline as it both > makes more sense to have this out of line and it also fixed a small > performance regression. Ah, I guess it is in the unlikely path. I often forget that noinline exists. It's interesting that that made a noticeable difference since it is a pretty short function. Thanks for taking such a close look! > Looks like I also failed to grep for all the remaining instances of > "heapgetpage" in 44086b097. Those are now fixed by 3a4a3537a. > > I also rebased the 0003 patch which I've attached as a raw patch. Thanks! > FWIW, using Heikki's test in [1] with a pg_prewarm each time after > restarting the instance. No parallel aggregate. > > drowley@amd3990x:~$ cat bench.sql > select count(*) from giga; > > drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency > > 44086b097~1 > latency average = 34.323 ms > latency average = 34.332 ms > > 44086b097 > latency average = 34.811 ms > latency average = 34.862 ms > > 3a4a3537a > latency average = 34.497 ms > latency average = 34.538 ms > > 3a4a3537a + read_stream_for_seqscans.patch > latency average = 40.923 ms > latency average = 41.415 ms > > i.e. no meaningful change from the refactor, but a regression from a > cached workload that changes the page often without doing much work in > between with the read stread patch. Cool. Thanks for testing this out. Sounds like Thomas did some analysis of how to resolve this for the streaming read user, and I will do some too. - Melanie
On Thu, Apr 4, 2024 at 7:45 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Apr 4, 2024 at 8:02 PM David Rowley <dgrowleyml@gmail.com> wrote: > > 3a4a3537a > > latency average = 34.497 ms > > latency average = 34.538 ms > > > > 3a4a3537a + read_stream_for_seqscans.patch > > latency average = 40.923 ms > > latency average = 41.415 ms > > > > i.e. no meaningful change from the refactor, but a regression from a > > cached workload that changes the page often without doing much work in > > between with the read stread patch. > > I ran Heikki's test except I ran the "insert" 4 times to get a table > of 4376MB according to \d+. On my random cloud ARM server (SB=8GB, > huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms > when the data is in LInux cache and PG is not prewarmed, roughly as he > reported. Good. > > If I pg_prewarm first, I see that slowdown 260ms -> 294ms. Trying > things out to see what works, I got that down to 243ms (ie beat > master) by inserting a memory prefetch: > > --- a/src/backend/storage/aio/read_stream.c > +++ b/src/backend/storage/aio/read_stream.c > @@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void > **per_buffer_data) > /* Prepare for the next call. */ > read_stream_look_ahead(stream, false); > > + __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index])); > > Maybe that's a solution to a different problem that just happens to > more than make up the difference in this case, and it may be > questionable whether that cache line will survive long enough to help > you, but this one-tuple-per-page test likes it... Hmm, to get a more > realistic table than the one-tuple-per-page on, I tried doubling a > tenk1 table until it reached 2759MB and then I got a post-prewarm > regression 702ms -> 721ms, and again I can beat master by memory > prefetching: 689ms. > > Annoyingly, with the same table I see no difference between the actual > pg_prewarm('giga') time: around 155ms for both. pg_prewarm is able to > use the 'fast path' I made pretty much just to be able to minimise > regression in that (probably stupid) all-cached tes that doesn't even > look at the page contents. Unfortunately seq scan can't use it > because it has per-buffer data, which is one of the things it can't do > (because of a space management issue). Maybe I should try to find a > way to fix that. So, sequential scan does not have per-buffer data. I did some logging and the reason most fully-in-SB sequential scans don't use the fast path is because read_stream->pending_read_nblocks is always 0. When when read_stream->distance stays 1 (expected for all-in-SB as it is initialized to 1 and we don't want distance to ramp up), read_stream_look_ahead() never increments read_stream->pending_read_nblocks because it sets it to 1 the first time it is called and then the conditions of the while loop are not met again while (stream->ios_in_progress < stream->max_ios && stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) distance is 1, pending_read_nblocks is 1, thus we only loop once and don't increment pending_read_nblocks. prewarm is only able to use the fast path because it passes READ_STREAM_FULL and thus initializes read_stream->distance to a higher initial value. I added some logging to see if any of the sequential scans in the regression suite used the fast path. The one example I see of the fast path being used is a temp table IO stats test in src/test/regress/sql/stats.sql. I didn't check exactly what conditions led it to do this. But we probably want seq scans which are all in SB to use the fast path. - Melanie
Hi, On 2024-04-04 22:37:39 +1300, Thomas Munro wrote: > On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Alright what about this? I think it's probably worth adding a bit more of the commit message to the function comment. Yes, there's a bit in one of the return branches, but that's not what you're going to look at when just checking what the function does. > From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Thu, 4 Apr 2024 21:11:06 +1300 > Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB. > > The BAS_VACUUM ring size has been 256kB since commit d526575f. Commit > 1cbbee03 made it configurable but retained the traditional default. > The correct default size has been debated for years, but 256kB is > certainly very small. VACUUM soon needs to write back data it dirtied > only 32 blocks ago, which usually requires flushing the WAL. New > experiments in prefetching pages for VACUUM exacerbated the problem by > crashing into dirty data even sooner. Let's make the default 2MB. > That's 1.5% of the default toy buffer pool size, and 0.2% of 1GB, which > would be a considered a small shared_buffers setting for a real system > these days. Users are still free to set the GUC to a different value. +1. Independent of any other changes, this improves the default vacuum performance substantially. We might want to dynamically size the default at some point - but we probably should overhaul the infrastructure first... Greetings, Andres Freund
On Fri, Apr 5, 2024 at 4:20 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > So, sequential scan does not have per-buffer data. I did some logging > and the reason most fully-in-SB sequential scans don't use the fast > path is because read_stream->pending_read_nblocks is always 0. Hnghghghgh... right, sorry I guessed the wrong reason, it turns out that I made a fast path just a little too specialised for pg_prewarm. Working on it...
Yeah, I plead benchmarking myopia, sorry. The fastpath as committed is only reached when distance goes 2->1, as pg_prewarm does. Oops. With the attached minor rearrangement, it works fine. I also poked some more at that memory prefetcher. Here are the numbers I got on a desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled, cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe, GCC -O3). create table t (i int, filler text) with (fillfactor=10); insert into t select g, repeat('x', 900) from generate_series(1, 560000) g; vacuum freeze t; set max_parallel_workers_per_gather = 0; select count(*) from t; cold = must be read from actual disk (Linux drop_caches) warm = read from linux page cache hot = already in pg cache via pg_prewarm cold warm hot master 2479ms 886ms 200ms seqscan 2498ms 716ms 211ms <-- regression seqscan + fastpath 2493ms 711ms 200ms <-- fixed, I think? seqscan + memprefetch 2499ms 716ms 182ms seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/ Cold has no difference. That's just my disk demonstrating Linux RA at 128kB (default); random I/O is obviously a more interesting story. It's consistently a smidgen faster with Linux RA set to 2MB (as in blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect probably also increases on fancier faster storage than what I have on hand: cold master 1775ms seqscan + fastpath + memprefetch 1700ms Warm is faster as expected (fewer system calls schlepping data kernel->userspace). The interesting column is hot. The 200ms->211ms regression is due to the extra bookkeeping in the slow path. The rejiggered fastpath code fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can you reproduce that? The memory prefetching trick, on top of that, seems to be a good optimisation so far. Note that that's not an entirely independent trick, it's something we can only do now that we can see into the future; it's the next level up of prefetching, worth doing around 60ns before you need the data I guess. Who knows how thrashed the cache might be before the caller gets around to accessing that page, but there doesn't seem to be much of a cost or downside to this bet. We know there are many more opportunities like that[1] but I don't want to second-guess the AM here, I'm just betting that the caller is going to look at the header. Unfortunately there seems to be a subtle bug hiding somewhere in here, visible on macOS on CI. Looking into that, going to find my Mac... [1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com
Attachment
On Thu, Apr 4, 2024 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > On 2024-04-04 22:37:39 +1300, Thomas Munro wrote: > > On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > Alright what about this? > > I think it's probably worth adding a bit more of the commit message to the > function comment. Yes, there's a bit in one of the return branches, but that's > not what you're going to look at when just checking what the function does. Agreed about the comment. I kept thinking that BAS_BULKREAD should maybe return nbuffers - 1, but I couldn't convince myself why. Otherwise v2-0001-Allow-BufferAccessStrategy-to-limit-pin-count LGTM. - Melanie
On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Yeah, I plead benchmarking myopia, sorry. The fastpath as committed > is only reached when distance goes 2->1, as pg_prewarm does. Oops. > With the attached minor rearrangement, it works fine. I also poked > some more at that memory prefetcher. Here are the numbers I got on a > desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled, > cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe, > GCC -O3). > > create table t (i int, filler text) with (fillfactor=10); > insert into t > select g, repeat('x', 900) from generate_series(1, 560000) g; > vacuum freeze t; > set max_parallel_workers_per_gather = 0; > > select count(*) from t; > > cold = must be read from actual disk (Linux drop_caches) > warm = read from linux page cache > hot = already in pg cache via pg_prewarm > > cold warm hot > master 2479ms 886ms 200ms > seqscan 2498ms 716ms 211ms <-- regression > seqscan + fastpath 2493ms 711ms 200ms <-- fixed, I think? > seqscan + memprefetch 2499ms 716ms 182ms > seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/ > > Cold has no difference. That's just my disk demonstrating Linux RA at > 128kB (default); random I/O is obviously a more interesting story. > It's consistently a smidgen faster with Linux RA set to 2MB (as in > blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect > probably also increases on fancier faster storage than what I have on > hand: > > cold > master 1775ms > seqscan + fastpath + memprefetch 1700ms > > Warm is faster as expected (fewer system calls schlepping data > kernel->userspace). > > The interesting column is hot. The 200ms->211ms regression is due to > the extra bookkeeping in the slow path. The rejiggered fastpath code > fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can > you reproduce that? I am able to reproduce the fast path solving the issue using Heikki's example here [1] but in shared buffers (hot). master: 25 ms stream read: 29 ms stream read + fast path: 25 ms I haven't looked into or reviewed the memory prefetching part. While reviewing 0002, I realized that I don't quite see how read_stream_get_block() will be used in the fastpath -- which it claims in its comments. read_stream_next_buffer() is the only caller of read_stream_look_ahead()->read_stream_get_block(), and if fast_path is true, read_stream_next_buffer() always returns before calling read_stream_look_ahead(). Maybe I am missing something. I see fast_path uses read_stream_fill_blocknums() to invoke the callback. Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist? Otherwise 0002 looks good to me. I haven't reviewed 0003 or 0004. I attached a new version (v11) because I noticed an outdated comment in my seq scan streaming read user patch (0001). The other patches in the set are untouched from your versions besides adding author/reviewer info in commit message for 0002. - Melanie [1] https://www.postgresql.org/message-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1%40iki.fi
Attachment
On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > The interesting column is hot. The 200ms->211ms regression is due to > > the extra bookkeeping in the slow path. The rejiggered fastpath code > > fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can > > you reproduce that? > > I am able to reproduce the fast path solving the issue using Heikki's > example here [1] but in shared buffers (hot). > > master: 25 ms > stream read: 29 ms > stream read + fast path: 25 ms Great, thanks. > I haven't looked into or reviewed the memory prefetching part. > > While reviewing 0002, I realized that I don't quite see how > read_stream_get_block() will be used in the fastpath -- which it > claims in its comments. Comments improved. > Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist? Just for testing purposes. Behaviour should be identical to external observers either way, it's just a hand-rolled specialisation for certain parameters, and it's useful to be able to verify that and measure the speedup. I think it's OK to leave a bit of developer/testing scaffolding in the tree if it's likely to be useful again and especially if like this case it doesn't create any dead code. (Perhaps in later work we might find the right way to get the compiler to do the specialisation? It's so simple though.) The occasional CI problem I mentioned turned out to be read_stream_reset() remembering a little too much state across rescans. Fixed. Thanks both for the feedback on the ring buffer tweaks. Comments updated. Here is the full stack of patches I would like to commit very soon, though I may leave the memory prefetching part out for a bit longer to see if I can find any downside.
Attachment
- v12-0001-Improve-read_stream.c-s-fast-path.patch
- v12-0002-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch
- v12-0003-Prefetch-page-header-memory-when-streaming-relat.patch
- v12-0004-Allow-BufferAccessStrategy-to-limit-pin-count.patch
- v12-0005-Increase-default-vacuum_buffer_usage_limit-to-2M.patch
- v12-0006-Use-streaming-IO-in-heapam-sequential-and-TID-ra.patch
On Fri, Apr 5, 2024 at 7:28 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I haven't looked into or reviewed the memory prefetching part. > > > > While reviewing 0002, I realized that I don't quite see how > > read_stream_get_block() will be used in the fastpath -- which it > > claims in its comments. > > Comments improved. Ah, makes sense now. > The occasional CI problem I mentioned turned out to be > read_stream_reset() remembering a little too much state across > rescans. Fixed. Nice investigative work figuring this out. > Thanks both for the feedback on the ring buffer tweaks. Comments > updated. Here is the full stack of patches I would like to commit > very soon, though I may leave the memory prefetching part out for a > bit longer to see if I can find any downside. 0001 LGTM. I did not review 0002 or 0003. 0004 looks good except for one comment typo: /* * Tell call not to pin more than half the buffers in the ring. * This is a trade-off between look ahead distance and deferring * writeback and associated WAL traffic. */ call -> caller 0006, I noticed the commit message is missing an important comma: Instead of calling ReadBuffer() for each block heap sequential scans and TID range scans now use the streaming read API introduced in b5a9b18cd0. should be Instead of calling ReadBuffer() for each block, heap sequential scans and TID range scans now use the streaming read API introduced in b5a9b18cd0. - Melanie
I found a bug in read_stream.c that could be hit with Melanie's streaming seq scan patch with parallelism enabled and certain buffer pool conditions. Short version: there is an edge case where an "if" needed to be a "while", or we could lose a few blocks. Here's the fix for that, longer explanation in commit message.
Attachment
On Sat, Apr 6, 2024 at 9:25 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I found a bug in read_stream.c that could be hit with Melanie's > streaming seq scan patch with parallelism enabled and certain buffer > pool conditions. Short version: there is an edge case where an "if" > needed to be a "while", or we could lose a few blocks. Here's the fix > for that, longer explanation in commit message. Attached v13 0001 is your fix and 0002 is a new version of the sequential scan streaming read user. Off-list Andres mentioned that I really ought to separate the parallel and serial sequential scan users into two different callbacks. I've done that in the attached. It actually makes the code used by the callbacks nicer and more readable anyway (putting aside performance). I was able to measure a small performance difference as well. I've also added a few comments and improved existing comments. - Melanie
Attachment
On Sun, Apr 7, 2024 at 1:34 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Attached v13 0001 is your fix and 0002 is a new version of the > sequential scan streaming read user. Off-list Andres mentioned that I > really ought to separate the parallel and serial sequential scan users > into two different callbacks. I've done that in the attached. It > actually makes the code used by the callbacks nicer and more readable > anyway (putting aside performance). I was able to measure a small > performance difference as well. Thanks. I changed a couple of very trivial things before pushing. + BlockNumber (*cb) (ReadStream *pgsr, void *private_data, + void *per_buffer_data); This type has a friendly name: ReadStreamBlockNumberCB. + scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL, I've been on the fence about that flag for sequential scan... Some days I want to consider changing to READ_STREAM_DEFAULT and relying on our "anti-heuristics" to suppress advice, which would work out the same in most cases but might occasionally win big. It might also hurt, I dunno, so I suspect we'd have to make it better first, which my patch in [1] was a first swing at, but I haven't researched that enough. So, kept this way! - * Read the next block of the scan relation into a buffer and pin that buffer - * before saving it in the scan descriptor. + * Read the next block of the scan relation from the read stream and pin that + * buffer before saving it in the scan descriptor. Changed to: * Read the next block of the scan relation from the read stream and save it * in the scan descriptor. It is already pinned. +static BlockNumber +heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data, + void *per_buffer_data) Changed argument names to match the function pointer type definition, "stream" and "callback_private_data". BTW looking at the branching in read-stream user patches that have an initialisation step like yours, I wonder if it might every make sense to be able to change the callback on the fly from inside the callback, so that you finish up with a branchless one doing most of the work. I have no idea if it's worth it... [1] https://www.postgresql.org/message-id/CA%2BhUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV%2Bvk8fwMbreA%40mail.gmail.com
Hi, On 2024-04-08 09:36:59 +1200, Thomas Munro wrote: > I've been on the fence about that flag for sequential scan... Some > days I want to consider changing to READ_STREAM_DEFAULT and relying on > our "anti-heuristics" to suppress advice, which would work out the > same in most cases but might occasionally win big. Agreed, it's pretty easy to end up with a fairly "fragmented" set of a relation's buffers in s_b. OTOH, there might not be any need for the heuristic if we actually trigger reads asynchronously. > BTW looking at the branching in read-stream user patches that have an > initialisation step like yours, I wonder if it might every make sense > to be able to change the callback on the fly from inside the callback, > so that you finish up with a branchless one doing most of the work. I > have no idea if it's worth it... I was wondering about that too, I dislike those branches. But instead of changing the callback, it seems like a better design would be to have another dedicated callback for that? There already is a dedicated branch for the "just starting up" path in read_stream_next_buffer(), so it'd be pretty much free to call another callback there. Greetings, Andres Freund
Hello, I decided to compare v17 vs v16 performance (as I did the last year [1]) and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds) benchmark, query15 (and several others, but I focused on this one): Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15 Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15 Please look at the full html report attached in case you're interested. (I used my pg-mark tool to measure/analyze performance, but I believe the same results can be seen without it.) `git bisect` for this performance degradation pointed at b7b0f3f27... [1] https://www.postgresql.org/message-id/b32bed1b-0746-9b20-1472-4bdc9ca66d52%40gmail.com Best regards, Alexander
Attachment
On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > I decided to compare v17 vs v16 performance (as I did the last year [1]) > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds) > benchmark, query15 (and several others, but I focused on this one): > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15 > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15 > Please look at the full html report attached in case you're interested. > > (I used my pg-mark tool to measure/analyze performance, but I believe the > same results can be seen without it.) Will investigate, but if it's easy for you to rerun, does it help if you increase Linux readahead, eg blockdev --setra setting?
On Sat, May 18, 2024 at 8:09 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > I decided to compare v17 vs v16 performance (as I did the last year [1]) > > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds) > > benchmark, query15 (and several others, but I focused on this one): > > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15 > > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15 > > Please look at the full html report attached in case you're interested. > > > > (I used my pg-mark tool to measure/analyze performance, but I believe the > > same results can be seen without it.) > > Will investigate, but if it's easy for you to rerun, does it help if > you increase Linux readahead, eg blockdev --setra setting? Andres happened to have TPC-DS handy, and reproduced that regression in q15. We tried some stuff and figured out that it requires parallel_leader_participation=on, ie that this looks like some kind of parallel fairness and/or timing problem. It seems to be a question of which worker finishes up processing matching rows, and the leader gets a ~10ms head start but may be a little more greedy with the new streaming code. He tried reordering the table contents and then saw 17 beat 16. So for q15, initial indications are that this isn't a fundamental regression, it's just a test that is sensitive to some arbitrary conditions. I'll try to figure out some more details about that, ie is it being too greedy on small-ish tables, and generally I do wonder about the interactions between the heuristics and batching working at different levels (OS, seq scan, read stream, hence my earlier ra question which is likely a red herring) and how there might be unintended consequences/interference patterns, but this particular case seems more data dependent.
On Sat, May 18, 2024 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Andres happened to have TPC-DS handy, and reproduced that regression > in q15. We tried some stuff and figured out that it requires > parallel_leader_participation=on, ie that this looks like some kind of > parallel fairness and/or timing problem. It seems to be a question of > which worker finishes up processing matching rows, and the leader gets > a ~10ms head start but may be a little more greedy with the new > streaming code. He tried reordering the table contents and then saw > 17 beat 16. So for q15, initial indications are that this isn't a > fundamental regression, it's just a test that is sensitive to some > arbitrary conditions. > > I'll try to figure out some more details about that, ie is it being > too greedy on small-ish tables, After more debugging, we learned a lot more things... 1. That query produces spectacularly bad estimates, so we finish up having to increase the number of buckets in a parallel hash join many times. That is quite interesting, but unrelated to new code. 2. Parallel hash join is quite slow at negotiating an increase in the number of hash bucket, if all of the input tuples are being filtered out by quals, because of the choice of where workers check for PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I think. I have put that on my todo list 'cause that's also my code, but it's not a new issue it's just one that is now highlighted... 3. This bit of read_stream.c is exacerbating unfairness in the underlying scan, so that 1 and 2 come together and produce a nasty slowdown, which goes away if you change it like so: - BlockNumber blocknums[16]; + BlockNumber blocknums[1]; I will follow up after some more study.
Hello Thomas, 18.05.2024 07:47, Thomas Munro wrote: > After more debugging, we learned a lot more things... > > 1. That query produces spectacularly bad estimates, so we finish up > having to increase the number of buckets in a parallel hash join many > times. That is quite interesting, but unrelated to new code. > 2. Parallel hash join is quite slow at negotiating an increase in the > number of hash bucket, if all of the input tuples are being filtered > out by quals, because of the choice of where workers check for > PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I > think. I have put that on my todo list 'cause that's also my code, > but it's not a new issue it's just one that is now highlighted... > 3. This bit of read_stream.c is exacerbating unfairness in the > underlying scan, so that 1 and 2 come together and produce a nasty > slowdown, which goes away if you change it like so: > > - BlockNumber blocknums[16]; > + BlockNumber blocknums[1]; > > I will follow up after some more study. Thank you for the information! Unfortunately, I can't see significant differences in my environment with parallel_leader_participation=off. With blocknums[1], timing is changed, but the effect is not persistent. 10 query15 executions in a row, b7b0f3f27: 277.932 ms 281.805 ms 278.335 ms 281.565 ms 284.167 ms 283.171 ms 281.165 ms 281.615 ms 285.394 ms 277.301 ms b7b0f3f27~1: 159.789 ms 165.407 ms 160.893 ms 159.343 ms 160.936 ms 161.577 ms 161.637 ms 163.421 ms 163.143 ms 167.109 ms b7b0f3f27 + blocknums[1]: 164.133 ms 280.920 ms 160.748 ms 163.182 ms 161.709 ms 161.998 ms 161.239 ms 276.256 ms 161.601 ms 160.384 ms I placed PGDATA on tmpfs to rule out any blockdev specifics (increasing blockdev ra from 256 to 4096 didn't help me with PGDATA on NVME either.) Best regards, Alexander
On Sun, May 19, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > With blocknums[1], timing is changed, but the effect is not persistent. > 10 query15 executions in a row, b7b0f3f27: > 277.932 ms > 281.805 ms > 278.335 ms > 281.565 ms > 284.167 ms > 283.171 ms > 281.165 ms > 281.615 ms > 285.394 ms > 277.301 ms The bad time 10/10. > b7b0f3f27~1: > 159.789 ms > 165.407 ms > 160.893 ms > 159.343 ms > 160.936 ms > 161.577 ms > 161.637 ms > 163.421 ms > 163.143 ms > 167.109 ms The good time 10/10. > b7b0f3f27 + blocknums[1]: > 164.133 ms > 280.920 ms > 160.748 ms > 163.182 ms > 161.709 ms > 161.998 ms > 161.239 ms > 276.256 ms > 161.601 ms > 160.384 ms The good time 8/10, the bad time 2/10. Thanks for checking! I bet all branches can show that flip/flop instability in these adverse conditions, depending on random scheduling details. I will start a new thread with a patch for the root cause of that, ie problem #2 (this will need back-patching), and post a fix for #3 (v17 blocknums[N] tweak affecting fairness/likelihood, which was probably basically a bit of ill-advised premature optimisation) here in a few days.
Thank you to all of you for looking into this. On Sat, May 18, 2024 at 12:47 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, May 18, 2024 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Andres happened to have TPC-DS handy, and reproduced that regression > > in q15. We tried some stuff and figured out that it requires > > parallel_leader_participation=on, ie that this looks like some kind of > > parallel fairness and/or timing problem. It seems to be a question of > > which worker finishes up processing matching rows, and the leader gets > > a ~10ms head start but may be a little more greedy with the new > > streaming code. He tried reordering the table contents and then saw > > 17 beat 16. So for q15, initial indications are that this isn't a > > fundamental regression, it's just a test that is sensitive to some > > arbitrary conditions. > > > > I'll try to figure out some more details about that, ie is it being > > too greedy on small-ish tables, > > After more debugging, we learned a lot more things... > > 1. That query produces spectacularly bad estimates, so we finish up > having to increase the number of buckets in a parallel hash join many > times. That is quite interesting, but unrelated to new code. > 2. Parallel hash join is quite slow at negotiating an increase in the > number of hash bucket, if all of the input tuples are being filtered > out by quals, because of the choice of where workers check for > PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I > think. I have put that on my todo list 'cause that's also my code, > but it's not a new issue it's just one that is now highlighted... > 3. This bit of read_stream.c is exacerbating unfairness in the > underlying scan, so that 1 and 2 come together and produce a nasty > slowdown, which goes away if you change it like so: > > - BlockNumber blocknums[16]; > + BlockNumber blocknums[1]; > > I will follow up after some more study. So, if you are seeing the slow-down mostly go away by reducing blocknums array size, does the regression only appear when the scan data is fully in shared buffers? Or is this blocknums other use (dealing with short reads)? Is your theory that one worker ends up reading 16 blocks that should have been distributed across multiple workers? - Melanie
On Tue, May 21, 2024 at 9:11 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > So, if you are seeing the slow-down mostly go away by reducing > blocknums array size, does the regression only appear when the scan > data is fully in shared buffers? Or is this blocknums other use > (dealing with short reads)? That must be true (that blocknums array is normally only "filled" in the "fast path", where all buffers are found in cache). > Is your theory that one worker ends up reading 16 blocks that should > have been distributed across multiple workers? Yes, it just jiggles the odds around a bit, introducing a bit of extra unfairness by calling the callback in a tighter loop to build a little batch, revealing a pre-existing problem. The mistake in PHJ (problem #2 above) is that, once a worker decides it would like all workers to stop inserting so it can increase the number of buckets, it sets a flag to ask them to do that, and waits for them to see it, but if there is a worker filtering all tuples out, it never checks the "growth" flag. So it scans all the way to the end while the other guy waits. Normally it checks that flag when it is time to allocate a new chunk of memory, which seemed to make sense to me at the time: if we've hit the needs-more-buckets (or needs-more-batches) logic, then surely workers are inserting tuples and will soon allocate a new chunk! But, of course, here is the edge case where that isn't true: we had bad estimates so hash table too small (problem #1), we got lots of tuples clustered over a few heap pages and decided to expand the hash table, but right at that moment, matching tuples ran out so somebody had to finish the whole scan without ever checking the flag (problem #2), and that someone happened to have all the rest of the pages because we made the lookahead a bit less fair (problem #3). Nice confluence of problems. I expect #2 and #3 to be easy to fix, and I didn't look at the estimation problem #1 at all (perhaps a stats puzzle designed by the TPC to trip us up?).
Hello Thomas, 27.08.2024 09:52, Thomas Munro wrote: > Here's a really simple way to see the new unfairness at the end of a > parallel scan: > > drop table if exists t; > create table t (i int); > insert into t select generate_series(1, 100000); > alter table t set (parallel_workers = 2); > set parallel_setup_cost = 0; > set parallel_leader_participation = off; > explain (analyze, buffers, verbose) select count(*) from t; > > On my machine, unpatched master shows: > > Worker 0: actual time=0.036..12.452 rows=51076 loops=1 > Buffers: shared hit=226 > Worker 1: actual time=0.037..12.003 rows=48924 loops=1 > Buffers: shared hit=217 > > The attached patch, which I'd like to push, is effectively what > Alexander tested (blocknums[16] -> blocknums[1]). There's no point in > using an array of size 1, so I've turned it into a simple variable and > deleted the relevant comments. My machine shows: > > Worker 0: actual time=0.038..12.115 rows=49946 loops=1 > Buffers: shared hit=221 > Worker 1: actual time=0.038..12.109 rows=50054 loops=1 > Buffers: shared hit=222 > > That difference may not seem huge, but other pre-existing things are > going pathologically wrong in the reported query that magnify it (see > my earlier analysis). It's an interesting problem that will require > more study (my earlier analysis missed a detail that I'll write about > separately), but it doesn't seem to be new or have easy fixes, so that > will have to be for later work. I've tried your query and could not get sustainable results, unfortunately. The following script: rm -rf "$PGDATA"; initdb -D "$PGDATA" >initdb.log 2>&1 pg_ctl -s -l server.log start cat << EOF | psql | grep 'Parallel Seq Scan' -A10 | grep 'Worker ' -A1 create table t (i int); insert into t select generate_series(1, 100000); alter table t set (parallel_workers = 2); set parallel_setup_cost = 0; set parallel_leader_participation = off; explain (analyze, buffers, verbose) select count(*) from t; EOF pg_ctl -s stop gives me unstable numbers on unpatched master: Worker 0: actual time=0.024..5.814 rows=51076 loops=1 Buffers: shared hit=226 Worker 1: actual time=0.023..5.614 rows=48924 loops=1 Buffers: shared hit=217 --- Worker 0: actual time=0.027..5.130 rows=36612 loops=1 Buffers: shared hit=162 Worker 1: actual time=0.013..5.605 rows=63388 loops=1 Buffers: shared hit=281 --- Worker 0: actual time=0.025..5.447 rows=47460 loops=1 Buffers: shared hit=210 Worker 1: actual time=0.019..5.688 rows=52540 loops=1 Buffers: shared hit=233 and also with the patch applied: Worker 0: actual time=0.012..4.486 rows=55478 loops=1 Buffers: shared hit=246 Worker 1: actual time=0.014..4.430 rows=44522 loops=1 Buffers: shared hit=197 --- Worker 0: actual time=0.013..4.269 rows=55822 loops=1 Buffers: shared hit=247 Worker 1: actual time=0.017..4.238 rows=44178 loops=1 Buffers: shared hit=196 --- Worker 0: actual time=0.014..4.974 rows=50624 loops=1 Buffers: shared hit=224 Worker 1: actual time=0.016..4.932 rows=49376 loops=1 Buffers: shared hit=219 --- Worker 0: actual time=0.012..5.459 rows=65648 loops=1 Buffers: shared hit=291 Worker 1: actual time=0.022..5.109 rows=34352 loops=1 Buffers: shared hit=152 Please correct me, if I'm doing something wrong. Best regards, Alexander
On Wed, Aug 28, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > gives me unstable numbers on unpatched master: > Buffers: shared hit=226 > Buffers: shared hit=217 > Buffers: shared hit=162 > Buffers: shared hit=281 > Buffers: shared hit=210 > Buffers: shared hit=233 > and also with the patch applied: > Buffers: shared hit=246 > Buffers: shared hit=197 > Buffers: shared hit=247 > Buffers: shared hit=196 > Buffers: shared hit=224 > Buffers: shared hit=219 > Buffers: shared hit=291 > Buffers: shared hit=152 > > Please correct me, if I'm doing something wrong. Huh. I can reproduce what I showed with pretty low variance, on my FreeBSD, Linux and macOS systems. I included parallel_leader_participation=off so that the workers would hopefully start as closely together in time as possible, and hopefully allow only a block or so of variation in the outcome. If I do: create or replace function f(i int) returns int language plpgsql parallel safe as $$ begin raise notice '% pid %', clock_timestamp(), pg_backend_pid(); return i; end; $$; then postgres=# explain (analyze) select f(i) from t limit 1; NOTICE: 2024-08-28 16:41:32.845747+12 pid 47019 NOTICE: 2024-08-28 16:41:32.845746+12 pid 47018 shows start times differ by only a few microseconds at most, often 0. I wonder if your system is more variable at beginning execution? Maybe you're on a multi-socket system and forking/startup times vary in some way I can't see on small systems or something like that?
28.08.2024 07:45, Thomas Munro wrote: > Huh. I can reproduce what I showed with pretty low variance, on my > FreeBSD, Linux and macOS systems. I included > parallel_leader_participation=off so that the workers would hopefully > start as closely together in time as possible, and hopefully allow > only a block or so of variation in the outcome. If I do: > > create or replace function f(i int) returns int language plpgsql > parallel safe as $$ > begin > raise notice '% pid %', clock_timestamp(), pg_backend_pid(); > return i; > end; > $$; > > then > > postgres=# explain (analyze) select f(i) from t limit 1; > NOTICE: 2024-08-28 16:41:32.845747+12 pid 47019 > NOTICE: 2024-08-28 16:41:32.845746+12 pid 47018 > > shows start times differ by only a few microseconds at most, often 0. > I wonder if your system is more variable at beginning execution? > Maybe you're on a multi-socket system and forking/startup times vary > in some way I can't see on small systems or something like that? I use a single-socket system with AMD Ryzen 5900X, running Ubuntu 20.04. With no active background tasks running, I'm seeing: NOTICE: 2024-08-28 08:17:36.917162+00 pid 320103 NOTICE: 2024-08-28 08:17:36.917163+00 pid 320102 NOTICE: 2024-08-28 08:17:40.592333+00 pid 320143 NOTICE: 2024-08-28 08:17:40.592645+00 pid 320144 With log_min_messages = DEBUG3, I get: NOTICE: 2024-08-28 08:41:59.309364+00 pid 338263 NOTICE: 2024-08-28 08:41:59.310079+00 pid 338264 And the following messages in the server.log: 2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker process "parallel worker for PID 338262" 2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker process "parallel worker for PID 338262" 2024-08-28 08:41:59.305 UTC [338263] DEBUG: InitPostgres 2024-08-28 08:41:59.305 UTC [338264] DEBUG: InitPostgres 2024-08-28 08:41:59.309 UTC [338263] NOTICE: 2024-08-28 08:41:59.309364+00 pid 338263 2024-08-28 08:41:59.309 UTC [338263] CONTEXT: PL/pgSQL function f(integer) line 3 at RAISE 2024-08-28 08:41:59.309 UTC [338262] NOTICE: 2024-08-28 08:41:59.309364+00 pid 338263 2024-08-28 08:41:59.309 UTC [338262] CONTEXT: PL/pgSQL function f(integer) line 3 at RAISE parallel worker 2024-08-28 08:41:59.309 UTC [338263] DEBUG: shmem_exit(0): 5 before_shmem_exit callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: shmem_exit(0): 6 on_shmem_exit callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: proc_exit(0): 1 callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: exit(0) 2024-08-28 08:41:59.309 UTC [338263] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: proc_exit(-1): 0 callbacks to make 2024-08-28 08:41:59.310 UTC [338264] NOTICE: 2024-08-28 08:41:59.310079+00 pid 338264 2024-08-28 08:41:59.310 UTC [338264] CONTEXT: PL/pgSQL function f(integer) line 3 at RAISE It looks like the two parallel workers were started simultaneously, but then the second one lagged behind... Best regards, Alexander
On Wed, Aug 28, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > 2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker process "parallel worker for PID 338262" > 2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker process "parallel worker for PID 338262" > 2024-08-28 08:41:59.305 UTC [338263] DEBUG: InitPostgres > 2024-08-28 08:41:59.305 UTC [338264] DEBUG: InitPostgres > 2024-08-28 08:41:59.309 UTC [338263] NOTICE: 2024-08-28 08:41:59.309364+00 pid 338263 > 2024-08-28 08:41:59.310 UTC [338264] NOTICE: 2024-08-28 08:41:59.310079+00 pid 338264 > It looks like the two parallel workers were started simultaneously, but > then the second one lagged behind... Yeah. That's quite interesting, and must destabilise that simple-minded demo. I'm curious to know exactly what contention is causing that (about 3/4 of a millisecond that I don't see and now I want to know what it's waiting for), but it's a very crude test lacking timer resolution in the earlier messages, and it's an unrelated topic and a distraction. Perhaps it explains why you saw two different behaviours in Q15 with the patch and I didn't, though. Really it shouldn't be so sensitive to such variations, it's obviously a terrible plan, and TPC-DS needs a planner hacker mega-brain applied to it; I'm going to try to nerd-snipe one...
Hello Thomas, 29.08.2024 01:16, Thomas Munro wrote: > Yeah. That's quite interesting, and must destabilise that > simple-minded demo. I'm curious to know exactly what contention is > causing that (about 3/4 of a millisecond that I don't see and now I > want to know what it's waiting for), but it's a very crude test > lacking timer resolution in the earlier messages, and it's an > unrelated topic and a distraction. Perhaps it explains why you saw > two different behaviours in Q15 with the patch and I didn't, though. > Really it shouldn't be so sensitive to such variations, it's obviously > a terrible plan, and TPC-DS needs a planner hacker mega-brain applied > to it; I'm going to try to nerd-snipe one... I looked at two perf profiles of such out-of-sync processes and found no extra calls or whatsoever in the slow one, it just has the number of perf samples increased proportionally. It made me suspect CPU frequency scaling... Indeed, with the "performance" governor set and the boost mode disabled, I'm now seeing much more stable numbers (I do this tuning before running performance tests, but I had forgotten about that when I ran that your test, my bad). I'm sorry for the noise and the distraction. Still, now I can confirm your results. Without the patch, two parallel workers gave "Buffers: shared hit=217 / Buffers: shared hit=226" 10 times out of 10. And with the patch, I've got "shared hit=219 / shared hit=224" 3 times, "shared hit=220 / shared hit=223" 4 times, "shared hit=221 / shared hit=222" 3 times of 10. On b7b0f3f27~1, my results are: "shared hit=219 / shared hit=224": 2 "shared hit=220 / shared hit=223": 3 "shared hit=221 / shared hit=222": 4 "shared hit=218 / shared hit=225": 1 Best regards, Alexander
On Fri, Aug 30, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > I looked at two perf profiles of such out-of-sync processes and found no > extra calls or whatsoever in the slow one, it just has the number of perf > samples increased proportionally. It made me suspect CPU frequency > scaling... Indeed, with the "performance" governor set and the boost mode > disabled, I'm now seeing much more stable numbers (I do this tuning before > running performance tests, but I had forgotten about that when I ran that > your test, my bad). Aha, mystery solved. I have pushed the fix. Thanks!