Thread: refactoring relation extension and BufferAlloc(), faster COPY
Hi, I'm working to extract independently useful bits from my AIO work, to reduce the size of that patchset. This is one of those pieces. In workloads that extend relations a lot, we end up being extremely contended on the relation extension lock. We've attempted to address that to some degree by using batching, which helps, but only so much. The fundamental issue, in my opinion, is that we do *way* too much while holding the relation extension lock. We acquire a victim buffer, if that buffer is dirty, we potentially flush the WAL, then write out that buffer. Then we zero out the buffer contents. Call smgrextend(). Most of that work does not actually need to happen while holding the relation extension lock. As far as I can tell, the minimum that needs to be covered by the extension lock is the following: 1) call smgrnblocks() 2) insert buffer[s] into the buffer mapping table at the location returned by smgrnblocks 3) mark buffer[s] as IO_IN_PROGRESS 1) obviously has to happen with the relation extension lock held because otherwise we might miss another relation extension. 2+3) need to happen with the lock held, because otherwise another backend not doing an extension could read the block before we're done extending, dirty it, write it out, and then have it overwritten by the extending backend. The reason we currently do so much work while holding the relation extension lock is that bufmgr.c does not know about the relation lock and that relation extension happens entirely within ReadBuffer* - there's no way to use a narrower scope for the lock. My fix for that is to add a dedicated function for extending relations, that can acquire the extension lock if necessary (callers can tell it to skip that, e.g., when initially creating an init fork). This routine is called by ReadBuffer_common() when P_NEW is passed in, to provide backward compatibility. To be able to acquire victim buffers outside of the extension lock, victim buffers are now acquired separately from inserting the new buffer mapping entry. Victim buffer are pinned, cleaned, removed from the buffer mapping table and marked invalid. Because they are pinned, clock sweeps in other backends won't return them. This is done in a new function, [Local]BufferAlloc(). This is similar to Yuri's patch at [0], but not that similar to earlier or later approaches in that thread. I don't really understand why that thread went on to ever more complicated approaches, when the basic approach shows plenty gains, with no issues around the number of buffer mapping entries that can exist etc. Other interesting bits I found: a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc() does, nearly doubles the amount of writes. First the kernel ends up writing out all the zeroed out buffers after a while, then when we write out the actual buffer contents. The best fix for that seems to be to optionally use posix_fallocate() to reserve space, without dirtying pages in the kernel page cache. However, it looks like that's only beneficial when extending by multiple pages at once, because it ends up causing one filesystem-journal entry for each extension on at least some filesystems. I added 'smgrzeroextend()' that can extend by multiple blocks, without the caller providing a buffer to write out. When extending by 8 or more blocks, posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is used to extend the file. b) I found that is quite beneficial to bulk-extend the relation with smgrextend() even without concurrency. The reason for that is the primarily the aforementioned dirty buffers that our current extension method causes. One bit that stumped me for quite a while is to know how much to extend the relation by. RelationGetBufferForTuple() drives the decision whether / how much to bulk extend purely on the contention on the extension lock, which obviously does not work for non-concurrent workloads. After quite a while I figured out that we actually have good information on how much to extend by, at least for COPY / heap_multi_insert(). heap_multi_insert() can compute how much space is needed to store all tuples, and pass that on to RelationGetBufferForTuple(). For that to be accurate we need to recompute that number whenever we use an already partially filled page. That's not great, but doesn't appear to be a measurable overhead. c) Contention on the FSM and the pages returned by it is a serious bottleneck after a) and b). The biggest issue is that the current bulk insertion logic in hio.c enters all but one of the new pages into the freespacemap. That will immediately cause all the other backends to contend on the first few pages returned the FSM, and cause contention on the FSM pages itself. I've, partially, addressed that by using the information about the required number of pages from b). Whether we bulk insert or not, the number of pages we know we're going to need for one heap_multi_insert() don't need to be added to the FSM - we're going to use them anyway. I've stashed the number of free blocks in the BulkInsertState for now, but I'm not convinced that that's the right place. If I revert just this part, the "concurrent COPY into unlogged table" benchmark goes from ~240 tps to ~190 tps. Even after that change the FSM is a major bottleneck. Below I included benchmarks showing this by just removing the use of the FSM, but I haven't done anything further about it. The contention seems to be both from updating the FSM, as well as thundering-herd like symptoms from accessing the FSM. The update part could likely be addressed to some degree with a batch update operation updating the state for multiple pages. The access part could perhaps be addressed by adding an operation that gets a page and immediately marks it as fully used, so other backends won't also try to access it. d) doing /* new buffers are zero-filled */ MemSet((char *) bufBlock, 0, BLCKSZ); under the extension lock is surprisingly expensive on my two socket workstation (but much less noticable on my laptop). If I move the MemSet back under the extension lock, the "concurrent COPY into unlogged table" benchmark goes from ~240 tps to ~200 tps. e) When running a few benchmarks for this email, I noticed that there was a sharp performance dropoff for the patched code for a pgbench -S -s100 on a database with 1GB s_b, start between 512 and 1024 clients. This started with the patch only acquiring one buffer partition lock at a time. Lots of debugging ensued, resulting in [3]. The problem isn't actually related to the change, it just makes it more visible, because the "lock chains" between two partitions reduce the average length of the wait queues substantially, by distribution them between more partitions. [3] has a reproducer that's entirely independent of this patchset. Bulk extension acquires a number of victim buffers, acquires the extension lock, inserts the buffers into the buffer mapping table and marks them as io-in-progress, calls smgrextend and releases the extension lock. After that buffer[s] are locked (depending on mode and an argument indicating the number of blocks to be locked), and TerminateBufferIO() is called. This requires two new pieces of infrastructure: First, pinning multiple buffers opens up the obvious danger that we might run of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each backend to pin a proportional share of buffers (although always allowing one, as we do today). Second, having multiple IOs in progress at the same time isn't possible with the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to deal with that instead. I like that this ends up removing a lot of AbortBufferIO() calls from the loops of various aux processes (now released inside ReleaseAuxProcessResources()). In very extreme workloads (single backend doing a pgbench -S -s 100 against a s_b=64MB database) the memory allocations triggered by StartBufferIO() are *just about* visible, not sure if that's worth worrying about - we do such allocations for the much more common pinning of buffers as well. The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation and a SMgrRelation argument, requiring at least one of them to be set. The reason for that is on the one hand that LockRelationForExtension() requires a relation and on the other hand, redo routines typically don't have a Relation around (recovery doesn't require an extension lock). That's not pretty, but seems a tad better than the ReadBufferExtended() vs ReadBufferWithoutRelcache() mess. I've done a fair bit of benchmarking of this patchset. For COPY it comes out ahead everywhere. It's possible that there's a very small regression for extremly IO miss heavy workloads, more below. server "base" configuration: max_wal_size=150GB shared_buffers=24GB huge_pages=on autovacuum=0 backend_flush_after=2MB max_connections=5000 wal_buffers=128MB wal_segment_size=1GB benchmark: pgbench running COPY into a single table. pgbench -t is set according to the client count, so that the same amount of data is inserted. This is done oth using small files ([1], ringbuffer not effective, no dirty data to write out within the benchmark window) and a bit larger files ([2], lots of data to write out due to ringbuffer). To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well. s_b=24GB test: unlogged_small_files, format: text, files: 1024, 9015MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 58.63 207 50.22 242 54.35 224 2 32.67 372 25.82 472 27.30 446 4 22.53 540 13.30 916 14.33 851 8 15.14 804 7.43 1640 7.48 1632 16 14.69 829 4.79 2544 4.50 2718 32 15.28 797 4.41 2763 3.32 3710 64 15.34 794 5.22 2334 3.06 4061 128 15.49 786 4.97 2452 3.13 3926 256 15.85 768 5.02 2427 3.26 3769 512 16.02 760 5.29 2303 3.54 3471 test: logged_small_files, format: text, files: 1024, 9018MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 68.18 178 59.41 205 63.43 192 2 39.71 306 33.10 368 34.99 348 4 27.26 446 19.75 617 20.09 607 8 18.84 646 12.86 947 12.68 962 16 15.96 763 9.62 1266 8.51 1436 32 15.43 789 8.20 1486 7.77 1579 64 16.11 756 8.91 1367 8.90 1383 128 16.41 742 10.00 1218 9.74 1269 256 17.33 702 11.91 1023 10.89 1136 512 18.46 659 14.07 866 11.82 1049 test: unlogged_medium_files, format: text, files: 64, 9018MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 63.27s 192 56.14 217 59.25 205 2 40.17s 303 29.88 407 31.50 386 4 27.57s 442 16.16 754 17.18 709 8 21.26s 573 11.89 1025 11.09 1099 16 21.25s 573 10.68 1141 10.22 1192 32 21.00s 580 10.72 1136 10.35 1178 64 20.64s 590 10.15 1200 9.76 1249 128 skipped 256 skipped 512 skipped test: logged_medium_files, format: text, files: 64, 9018MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 71.89s 169 65.57 217 69.09 69.09 2 47.36s 257 36.22 407 38.71 38.71 4 33.10s 368 21.76 754 22.78 22.78 8 26.62s 457 15.89 1025 15.30 15.30 16 24.89s 489 17.08 1141 15.20 15.20 32 25.15s 484 17.41 1136 16.14 16.14 64 26.11s 466 17.89 1200 16.76 16.76 128 skipped 256 skipped 512 skipped Just to see how far it can be pushed, with binary format we can now get to nearly 6GB/s into a table when disabling the FSM - note the 2x difference between patch and patch+no-fsm at 32 clients. test: unlogged_small_files, format: binary, files: 1024, 9508MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 34.14 357 28.04 434 29.46 413 2 22.67 537 14.42 845 14.75 826 4 16.63 732 7.62 1599 7.69 1587 8 13.48 904 4.36 2795 4.13 2959 16 14.37 848 3.78 3224 2.74 4493 32 14.79 823 4.20 2902 2.07 5974 64 14.76 825 5.03 2423 2.21 5561 128 14.95 815 4.36 2796 2.30 5343 256 15.18 802 4.31 2828 2.49 4935 512 15.41 790 4.59 2656 2.84 4327 s_b=4GB test: unlogged_small_files, format: text, files: 1024, 9018MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 62.55 194 54.22 224 2 37.11 328 28.94 421 4 25.97 469 16.42 742 8 20.01 609 11.92 1022 16 19.55 623 11.05 1102 32 19.34 630 11.27 1081 64 19.07 639 12.04 1012 128 19.22 634 12.27 993 256 19.34 630 12.28 992 512 19.60 621 11.74 1038 test: logged_small_files, format: text, files: 1024, 9018MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 71.71 169 63.63 191 2 46.93 259 36.31 335 4 30.37 401 22.41 543 8 22.86 533 16.90 721 16 20.18 604 14.07 866 32 19.64 620 13.06 933 64 19.71 618 15.08 808 128 19.95 610 15.47 787 256 20.48 595 16.53 737 512 21.56 565 16.86 722 test: unlogged_medium_files, format: text, files: 64, 9018MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 62.65 194 55.74 218 2 40.25 302 29.45 413 4 27.37 445 16.26 749 8 22.07 552 11.75 1037 16 21.29 572 10.64 1145 32 20.98 580 10.70 1139 64 20.65 590 10.21 1193 128 skipped 256 skipped 512 skipped test: logged_medium_files, format: text, files: 64, 9018MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 71.72 169 65.12 187 2 46.46 262 35.74 341 4 32.61 373 21.60 564 8 26.69 456 16.30 747 16 25.31 481 17.00 716 32 24.96 488 17.47 697 64 26.05 467 17.90 680 128 skipped 256 skipped 512 skipped test: unlogged_small_files, format: binary, files: 1024, 9505MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 37.62 323 32.77 371 2 28.35 429 18.89 645 4 20.87 583 12.18 1000 8 19.37 629 10.38 1173 16 19.41 627 10.36 1176 32 18.62 654 11.04 1103 64 18.33 664 11.89 1024 128 18.41 661 11.91 1023 256 18.52 658 12.10 1007 512 18.78 648 11.49 1060 benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily. The run-to-run variance on my workstation is high for this workload (both before/after my changes). I also found that the ramp-up time at higher client counts is very significant: progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed ... One would need to run pgbench for impractically long to make that effect vanish. My not great solution for these was to run with -T21 -P5 and use the best 5s as the tps. s_b=1GB tps tps clients master patched 1 49541 48805 2 85342 90010 4 167340 168918 8 308194 303222 16 524294 523678 32 649516 649100 64 932547 937702 128 908249 906281 256 856496 903979 512 764254 934702 1024 653886 925113 2048 569695 917262 4096 526782 903258 s_b=128MB: tps tps clients master patched 1 40407 39854 2 73180 72252 4 143334 140860 8 240982 245331 16 429265 420810 32 544593 540127 64 706408 726678 128 713142 718087 256 611030 695582 512 552751 686290 1024 508248 666370 2048 474108 656735 4096 448582 633040 As there might be a small regression at the smallest end, I ran a more extreme version of the above. Using a pipelined pgbench -S, with a single client, for longer. With s_b=8MB. To further reduce noise I pinned the server to one cpu, the client to another and disabled turbo mode on the CPU. master "total" tps: 61.52 master "best 5s" tps: 61.8 patch "total" tps: 61.20 patch "best 5s" tps: 61.4 Hardly conclusive, but it does look like there's a small effect. It could be code layout or such. My guess however is that it's the resource owner for in-progress IO that I added - that adds an additional allocation inside the resowner machinery. I commented those out (that's obviously incorrect!) just to see whether that changes anything: no-resowner "total" tps: 62.03 no-resowner "best 5s" tps: 62.2 So it looks like indeed, it's the resowner. I am a bit surprised, because obviously we already use that mechanism for pins, which obviously is more frequent. I'm not sure it's worth worrying about - this is a pretty absurd workload. But if we decide it is, I can think of a few ways to address this. E.g.: - We could preallocate an initial element inside the ResourceArray struct, so that a newly created resowner won't need to allocate immediately - We could only use resowners if there's more than one IO in progress at the same time - but I don't like that idea much - We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin' resowner entry - on 64bit system there's plenty space for that. But on 32bit systems... The patches here aren't fully polished (as will be evident). But they should be more than good enough to discuss whether this is a sane direction. Greetings, Andres Freund [0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru [1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMATtest); [2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMATtext); [3] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Attachment
- v1-0001-wip-lwlock-fix-quadratic-behaviour-with-very-long.patch
- v1-0002-aio-Add-some-error-checking-around-pinning.patch
- v1-0003-hio-Release-extension-lock-before-initializing-pa.patch
- v1-0004-Add-smgrzeroextend-FileZero-FileFallocate.patch
- v1-0005-bufmgr-Add-Pin-UnpinLocalBuffer.patch
- v1-0006-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
- v1-0007-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch
- v1-0008-bufmgr-Move-relation-extension-handling-into-Bulk.patch
- v1-0009-Convert-a-few-places-to-ExtendRelationBuffered.patch
- v1-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch
- v1-0011-hio-Use-BulkExtendRelationBuffered.patch
- v1-0012-bufmgr-debug-Add-PrintBuffer-Desc.patch
On Sat, Oct 29, 2022 at 8:24 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > I'm working to extract independently useful bits from my AIO work, to reduce > the size of that patchset. This is one of those pieces. Thanks a lot for this great work. There are 12 patches in this thread, I believe each of these patches is trying to solve separate problems and can be reviewed and get committed separately, am I correct? > In workloads that extend relations a lot, we end up being extremely contended > on the relation extension lock. We've attempted to address that to some degree > by using batching, which helps, but only so much. Yes, I too have observed this in the past for parallel inserts in CTAS work - https://www.postgresql.org/message-id/CALj2ACW9BUoFqWkmTSeHjFD-W7_00s3orqSvtvUk%2BKD2H7ZmRg%40mail.gmail.com. Tackling bulk relation extension problems will unblock the parallel inserts (in CTAS, COPY) work I believe. > The fundamental issue, in my opinion, is that we do *way* too much while > holding the relation extension lock. We acquire a victim buffer, if that > buffer is dirty, we potentially flush the WAL, then write out that > buffer. Then we zero out the buffer contents. Call smgrextend(). > > Most of that work does not actually need to happen while holding the relation > extension lock. As far as I can tell, the minimum that needs to be covered by > the extension lock is the following: > > 1) call smgrnblocks() > 2) insert buffer[s] into the buffer mapping table at the location returned by > smgrnblocks > 3) mark buffer[s] as IO_IN_PROGRESS Makes sense. I will try to understand and review each patch separately. Firstly, 0001 avoids extra loop over waiters and looks a reasonable change, some comments on the patch: 1) + int lwWaiting; /* 0 if not waiting, 1 if on waitlist, 2 if + * waiting to be woken */ Use macros instead of hard-coded values for better readability? #define PROC_LW_LOCK_NOT_WAITING 0 #define PROC_LW_LOCK_ON_WAITLIST 1 #define PROC_LW_LOCK_WAITING_TO_BE_WOKEN 2 2) Missing initialization of lwWaiting to 0 or the macro in twophase.c and proc.c. proc->lwWaiting = false; MyProc->lwWaiting = false; 3) + proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink); + found = true; I guess 'found' is a bit meaningless here as we are doing away with the proclist_foreach_modify loop. We can directly use MyProc->lwWaiting == 1 and remove 'found'. 4) if (!MyProc->lwWaiting) if (!proc->lwWaiting) Can we modify the above conditions in lwlock.c to MyProc->lwWaiting != 1 or PROC_LW_LOCK_ON_WAITLIST or the macro? 5) Is there any specific test case that I can see benefit of this patch? If yes, can you please share it here? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2022-10-29 18:33:53 +0530, Bharath Rupireddy wrote: > On Sat, Oct 29, 2022 at 8:24 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > I'm working to extract independently useful bits from my AIO work, to reduce > > the size of that patchset. This is one of those pieces. > > Thanks a lot for this great work. There are 12 patches in this thread, > I believe each of these patches is trying to solve separate problems > and can be reviewed and get committed separately, am I correct? Mostly, yes. For 0001 I already started https://www.postgresql.org/message-id/20221027165914.2hofzp4cvutj6gin%40awork3.anarazel.de to discuss the specific issue. We don't strictly need v1-0002-aio-Add-some-error-checking-around-pinning.patch but I did find it useful. v1-0012-bufmgr-debug-Add-PrintBuffer-Desc.patch is not used in the patch series, but I found it quite useful when debugging issues with the patch. A heck of a lot easier to interpret page flags when they can be printed. I also think there's some architectural questions that'll influence the number of patches. E.g. I'm not convinced v1-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch is quite the right spot to track which additional pages should be used. It could very well instead be alongside ->smgr_targblock. Possibly the best path would instead be to return the additional pages explicitly to callers of RelationGetBufferForTuple, but RelationGetBufferForTuple does a bunch of work around pinning that potentially would need to be repeated in heap_multi_insert(). > > In workloads that extend relations a lot, we end up being extremely contended > > on the relation extension lock. We've attempted to address that to some degree > > by using batching, which helps, but only so much. > > Yes, I too have observed this in the past for parallel inserts in CTAS > work - https://www.postgresql.org/message-id/CALj2ACW9BUoFqWkmTSeHjFD-W7_00s3orqSvtvUk%2BKD2H7ZmRg%40mail.gmail.com. > Tackling bulk relation extension problems will unblock the parallel > inserts (in CTAS, COPY) work I believe. Yea. There's a lot of places the current approach ended up being a bottleneck. > Firstly, 0001 avoids extra loop over waiters and looks a reasonable > change, some comments on the patch: > 1) > + int lwWaiting; /* 0 if not waiting, 1 if on > waitlist, 2 if > + * waiting to be woken */ > Use macros instead of hard-coded values for better readability? > > #define PROC_LW_LOCK_NOT_WAITING 0 > #define PROC_LW_LOCK_ON_WAITLIST 1 > #define PROC_LW_LOCK_WAITING_TO_BE_WOKEN 2 Yea - this was really more of a prototype patch - I noted that we'd want to use defines for this in https://www.postgresql.org/message-id/20221027165914.2hofzp4cvutj6gin%40awork3.anarazel.de > 3) > + proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink); > + found = true; > I guess 'found' is a bit meaningless here as we are doing away with > the proclist_foreach_modify loop. We can directly use > MyProc->lwWaiting == 1 and remove 'found'. We can rename it, but I think we still do need it, it's easier to analyze the logic if the relevant check happens on a value from while we held the wait list lock. Probably should do the reset inside the locked section as well. > 4) > if (!MyProc->lwWaiting) > if (!proc->lwWaiting) > Can we modify the above conditions in lwlock.c to MyProc->lwWaiting != > 1 or PROC_LW_LOCK_ON_WAITLIST or the macro? I think it's better to check it's 0, rather than just != 1. > 5) Is there any specific test case that I can see benefit of this > patch? If yes, can you please share it here? Yep, see the other thread, there's a pretty easy case there. You can also see it at extreme client counts with a pgbench -S against a cluster with a smaller shared_buffers. But the difference is not huge before something like 2048-4096 clients, and then it only occurs occasionally (because you need to end up with most connections waiting for one of the partitions). So the test case from the other thread is a lot better. Greetings, Andres Freund
On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote: > > The patches here aren't fully polished (as will be evident). But they should > be more than good enough to discuss whether this is a sane direction. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID f2857af485a00ab5dbfa2c83af9d83afe4378239 === === applying patch ./v1-0001-wip-lwlock-fix-quadratic-behaviour-with-very-long.patch patching file src/include/storage/proc.h Hunk #1 FAILED at 217. 1 out of 1 hunk FAILED -- saving rejects to file src/include/storage/proc.h.rej patching file src/backend/storage/lmgr/lwlock.c Hunk #1 succeeded at 988 with fuzz 2 (offset 1 line). Hunk #2 FAILED at 1047. Hunk #3 FAILED at 1076. Hunk #4 FAILED at 1104. Hunk #5 FAILED at 1117. Hunk #6 FAILED at 1141. Hunk #7 FAILED at 1775. Hunk #8 FAILED at 1790. 7 out of 8 hunks FAILED -- saving rejects to file src/backend/storage/lmgr/lwlock.c.rej [1] - http://cfbot.cputube.org/patch_41_3993.log Regards, Vignesh
Hi, On 2023-01-06 11:52:04 +0530, vignesh C wrote: > On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote: > > > > The patches here aren't fully polished (as will be evident). But they should > > be more than good enough to discuss whether this is a sane direction. > > The patch does not apply on top of HEAD as in [1], please post a rebased > patch. Thanks for letting me now. Updated version attached. Greetings, Andres Freund
Attachment
- v2-0001-aio-Add-some-error-checking-around-pinning.patch
- v2-0002-hio-Release-extension-lock-before-initializing-pa.patch
- v2-0003-Add-smgrzeroextend-FileZero-FileFallocate.patch
- v2-0004-bufmgr-Add-Pin-UnpinLocalBuffer.patch
- v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
- v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch
- v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch
- v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch
- v2-0009-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch
- v2-0010-hio-Use-BulkExtendRelationBuffered.patch
- v2-0011-bufmgr-debug-Add-PrintBuffer-Desc.patch
On Tue, 10 Jan 2023 at 15:08, Andres Freund <andres@anarazel.de> wrote: > Thanks for letting me now. Updated version attached. I'm not too sure I've qualified for giving a meaningful design review here, but I have started looking at the patches and so far only made it as far as 0006. I noted down the following while reading: v2-0001: 1. BufferCheckOneLocalPin needs a header comment v2-0002: 2. The following comment and corresponding code to release the extension lock has been moved now. /* * Release the file-extension lock; it's now OK for someone else to extend * the relation some more. */ I think it's worth detailing out why it's fine to release the extension lock in the new location. You've added detail to the commit message but I think you need to do the same in the comments too. v2-0003 3. FileFallocate() and FileZero() should likely document what they return, i.e zero on success and non-zero on failure. 4. I'm not quite clear on why you've modified FileGetRawDesc() to call FileAccess() twice. v2-0004: 5. Is it worth having two versions of PinLocalBuffer() one to adjust the usage count and one that does not? Couldn't the version that does not adjust the count skip doing pg_atomic_read_u32()? v2-0005 v2-0006 David
I'll continue reviewing this, but here's some feedback on the first two patches: v2-0001-aio-Add-some-error-checking-around-pinning.patch: I wonder if the extra assertion in LockBufHdr() is worth the overhead. It won't add anything without assertions, of course, but still. No objections if you think it's worth it. v2-0002-hio-Release-extension-lock-before-initializing-pa.patch: Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK, which zeroes the page, and then we call PageInit to zero the page again. RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache previously, but with P_NEW, that is always true. - Heikki
> v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch This can be applied separately from the rest of the patches, which is nice. Some small comments on it: * Needs a rebase, it conflicted slightly with commit f30d62c2fc. * GetVictimBuffer needs a comment to explain what it does. In particular, mention that it returns a buffer that's pinned and known !BM_TAG_VALID. * I suggest renaming 'cur_buf' and other such local variables in GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some other buffer involved too, but there is no 'prev' or 'next' or 'other' buffer. The old code called it just 'buf' too, and before this patch it actually was a bit confusing because there were two buffers involved. But with this patch, GetVictimBuffer only deals with one buffer at a time. * This FIXME: > /* OK, do the I/O */ > /* FIXME: These used the wrong smgr before afaict? */ > { > SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&buf_hdr->tag), > InvalidBackendId); > > TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum, > buf_hdr->tag.blockNum, > smgr->smgr_rlocator.locator.spcOid, > smgr->smgr_rlocator.locator.dbOid, > smgr->smgr_rlocator.locator.relNumber); > > FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, io_context); > LWLockRelease(content_lock); > > ScheduleBufferTagForWriteback(&BackendWritebackContext, > &buf_hdr->tag); > > TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum, > buf_hdr->tag.blockNum, > smgr->smgr_rlocator.locator.spcOid, > smgr->smgr_rlocator.locator.dbOid, > smgr->smgr_rlocator.locator.relNumber); > } I believe that was intentional. The probes previously reported the block and relation whose read *caused* the eviction. It was not just the smgr but also the blockNum and forkNum that referred to the block that was being read. There's another pair of probe points, TRACE_POSTGRESQL_BUFFER_FLUSH_START/DONE, inside FlushBuffer that indicate the page that is being flushed. I see that reporting the evicted page is more convenient with this patch, otherwise you'd need to pass the smgr and blocknum of the page that's being read to InvalidateVictimBuffer(). IMHO you can just remove these probe points. We don't need to bend over backwards to maintain specific probe points. * InvalidateVictimBuffer reads the buffer header with an atomic read op, just to check if BM_TAG_VALID is set. If it's not, it does nothing (except for a few Asserts). But the caller has already read the buffer header. Consider refactoring it so that the caller checks VM_TAG_VALID, and only calls InvalidateVictimBuffer if it's set, saving one atomic read in InvalidateVictimBuffer. I think it would be just as readable, so no loss there. I doubt the atomic read makes any measurable performance difference, but it looks redundant. * I don't understand this comment: > /* > * Clear out the buffer's tag and flags and usagecount. We must do > * this to ensure that linear scans of the buffer array don't think > * the buffer is valid. > * > * XXX: This is a pre-existing comment I just moved, but isn't it > * entirely bogus with regard to the tag? We can't do anything with > * the buffer without taking BM_VALID / BM_TAG_VALID into > * account. Likely doesn't matter because we're already dirtying the > * cacheline, but still. > * > */ > ClearBufferTag(&buf_hdr->tag); > buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); > UnlockBufHdr(buf_hdr, buf_state); What exactly is wrong with clearing the tag? What does dirtying the cacheline have to do with the correctness here? * pgindent - Heikki
Hi, On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote: > > v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch > This can be applied separately from the rest of the patches, which is nice. > Some small comments on it: Thanks for looking at these! > * Needs a rebase, it conflicted slightly with commit f30d62c2fc. Will work on that. > * GetVictimBuffer needs a comment to explain what it does. In particular, > mention that it returns a buffer that's pinned and known !BM_TAG_VALID. Will add. > * I suggest renaming 'cur_buf' and other such local variables in > GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some > other buffer involved too, but there is no 'prev' or 'next' or 'other' > buffer. The old code called it just 'buf' too, and before this patch it > actually was a bit confusing because there were two buffers involved. But > with this patch, GetVictimBuffer only deals with one buffer at a time. Hm. Yea. I probably ended up with these names because initially GetVictimBuffer() wasn't a separate function, and I indeed constantly got confused by which buffer was referenced. > * This FIXME: > > > /* OK, do the I/O */ > > /* FIXME: These used the wrong smgr before afaict? */ > > { > > SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&buf_hdr->tag), > > InvalidBackendId); > > > > TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum, > > buf_hdr->tag.blockNum, > > smgr->smgr_rlocator.locator.spcOid, > > smgr->smgr_rlocator.locator.dbOid, > > smgr->smgr_rlocator.locator.relNumber); > > > > FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, io_context); > > LWLockRelease(content_lock); > > > > ScheduleBufferTagForWriteback(&BackendWritebackContext, > > &buf_hdr->tag); > > > > TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum, > > buf_hdr->tag.blockNum, > > smgr->smgr_rlocator.locator.spcOid, > > smgr->smgr_rlocator.locator.dbOid, > > smgr->smgr_rlocator.locator.relNumber); > > } > > I believe that was intentional. The probes previously reported the block and > relation whose read *caused* the eviction. It was not just the smgr but also > the blockNum and forkNum that referred to the block that was being read. You're probably right. It's certainly not understandable from our docs though: <row> <entry><literal>buffer-write-dirty-start</literal></entry> <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry> <entry>Probe that fires when a server process begins to write a dirty buffer. (If this happens often, it implies that <xref linkend="guc-shared-buffers"/> is too small or the background writer control parameters need adjustment.) arg0 and arg1 contain the fork and block numbers of the page. arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs identifying the relation.</entry> </row> > I see that reporting the evicted page is more convenient with this patch, > otherwise you'd need to pass the smgr and blocknum of the page that's being > read to InvalidateVictimBuffer(). IMHO you can just remove these probe > points. We don't need to bend over backwards to maintain specific probe > points. Agreed. > * InvalidateVictimBuffer reads the buffer header with an atomic read op, > just to check if BM_TAG_VALID is set. It's not a real atomic op, in the sense of being special instruction. It does force the compiler to actually read from memory, but that's it. But you're right, even that is unnecessary. I think it ended up that way because I also wanted the full buf_hdr, and it seemed somewhat error prone to pass in both. > * I don't understand this comment: > > > /* > > * Clear out the buffer's tag and flags and usagecount. We must do > > * this to ensure that linear scans of the buffer array don't think > > * the buffer is valid. > > * > > * XXX: This is a pre-existing comment I just moved, but isn't it > > * entirely bogus with regard to the tag? We can't do anything with > > * the buffer without taking BM_VALID / BM_TAG_VALID into > > * account. Likely doesn't matter because we're already dirtying the > > * cacheline, but still. > > * > > */ > > ClearBufferTag(&buf_hdr->tag); > > buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); > > UnlockBufHdr(buf_hdr, buf_state); > > What exactly is wrong with clearing the tag? What does dirtying the > cacheline have to do with the correctness here? There's nothing wrong with clearing out the tag, but I don't think it's a hard requirement today, and certainly not for the reason stated above. Validity of the buffer isn't determined by the tag, it's determined by BM_VALID (or, if you interpret valid more widely, BM_TAG_VALID). Without either having pinned the buffer, or holding the buffer header spinlock, the tag can change at any time. And code like DropDatabaseBuffers() knows that, and re-checks the the tag after locking the buffer header spinlock. Afaict, there'd be no correctness issue with removing the ClearBufferTag(). There would be an efficiency issue though, because when encountering an invalid buffer, we'd unnecessarily enter InvalidateBuffer(), which'd find that BM_[TAG_]VALID isn't set, and not to anything. Even though it's not a correctness issue, it seems to me that DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID, before doing anything further. Particularly in DropRelationsAllBuffers(), the check we do for each buffer isn't cheap. Doing it for buffers that don't even have a tag seems .. not smart. Greetings, Andres Freund
On 2023-02-11 13:36:51 -0800, Andres Freund wrote: > Even though it's not a correctness issue, it seems to me that > DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID, > before doing anything further. Particularly in DropRelationsAllBuffers(), the > check we do for each buffer isn't cheap. Doing it for buffers that don't even > have a tag seems .. not smart. There's a small regression for a single relation, but after that it's a clear benefit. 32GB shared buffers, empty. The test creates N new relations and then rolls back. tps tps num relations HEAD precheck 1 46.11 45.22 2 43.24 44.87 4 35.14 44.20 8 28.72 42.79 I don't understand the regression at 1, TBH. I think it must be a random code layout issue, because the same pre-check in DropRelationBuffers() (exercised via TRUNCATE of a newly created relation), shows a tiny speedup.
Hi, On 2023-02-10 18:38:50 +0200, Heikki Linnakangas wrote: > I'll continue reviewing this, but here's some feedback on the first two > patches: > > v2-0001-aio-Add-some-error-checking-around-pinning.patch: > > I wonder if the extra assertion in LockBufHdr() is worth the overhead. It > won't add anything without assertions, of course, but still. No objections > if you think it's worth it. It's so easy to get confused about local/non-local buffers, that I think it is useful. I think we really need to consider cleaning up the separation further. Having half the code for local buffers in bufmgr.c and the other half in localbuf.c, without a scheme that I can recognize, is not a good scheme. It bothers me somewhat ConditionalLockBufferForCleanup() silently accepts multiple pins by the current backend. That's the right thing for e.g. heap_page_prune_opt(), but for something like lazy_scan_heap() it's not. And yes, I did encounter a bug hidden by that when making vacuumlazy use AIO as part of that patchset. That's why I made BufferCheckOneLocalPin() externally visible. > v2-0002-hio-Release-extension-lock-before-initializing-pa.patch: > > Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK, > which zeroes the page, and then we call PageInit to zero the page again. > RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache > previously, but with P_NEW, that is always true. It is quite silly, and it shows up noticably in profiles. The zeroing is definitely needed in other places calling PageInit(), though. I suspect we should have a PageInitZeroed() or such, that asserts the page is zero, but otherwise skips it. Seems independent enough from this series, that I'd probably tackle it separately? If you prefer, I'm ok with adding a patch to this series instead, though. Greetings, Andres Freund
Hi, On 2023-01-20 13:40:55 +1300, David Rowley wrote: > On Tue, 10 Jan 2023 at 15:08, Andres Freund <andres@anarazel.de> wrote: > > Thanks for letting me now. Updated version attached. > > I'm not too sure I've qualified for giving a meaningful design review > here, but I have started looking at the patches and so far only made > it as far as 0006. Thanks! > I noted down the following while reading: > > v2-0001: > > 1. BufferCheckOneLocalPin needs a header comment > > v2-0002: > > 2. The following comment and corresponding code to release the > extension lock has been moved now. > > /* > * Release the file-extension lock; it's now OK for someone else to extend > * the relation some more. > */ > > I think it's worth detailing out why it's fine to release the > extension lock in the new location. You've added detail to the commit > message but I think you need to do the same in the comments too. Will do. > v2-0003 > > 3. FileFallocate() and FileZero() should likely document what they > return, i.e zero on success and non-zero on failure. I guess I just tried to fit in with the rest of the file :) > 4. I'm not quite clear on why you've modified FileGetRawDesc() to call > FileAccess() twice. I do not have the faintest idea what happened there... Will fix. > v2-0004: > > 5. Is it worth having two versions of PinLocalBuffer() one to adjust > the usage count and one that does not? Couldn't the version that does > not adjust the count skip doing pg_atomic_read_u32()? I think it'd be nicer to just move the read inside the if (adjust_usagecount). That way the rest of the function doesn't have to be duplicated. Thanks, Andres Freund
Hi, On 2023-02-11 14:25:06 -0800, Andres Freund wrote: > On 2023-01-20 13:40:55 +1300, David Rowley wrote: > > v2-0004: > > > > 5. Is it worth having two versions of PinLocalBuffer() one to adjust > > the usage count and one that does not? Couldn't the version that does > > not adjust the count skip doing pg_atomic_read_u32()? > > I think it'd be nicer to just move the read inside the if > (adjust_usagecount). That way the rest of the function doesn't have to be > duplicated. Ah, no, we need it for the return value. No current users of PinLocalBuffer(adjust_usagecount = false) need the return value, but I don't think that's necessarily the case. I'm somewhat inclined to not duplicate it, but if you think it's worth it, I'll do that. Greetings, Andres Freund
On 11/02/2023 23:36, Andres Freund wrote: > On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote: >> * I don't understand this comment: >> >>> /* >>> * Clear out the buffer's tag and flags and usagecount. We must do >>> * this to ensure that linear scans of the buffer array don't think >>> * the buffer is valid. >>> * >>> * XXX: This is a pre-existing comment I just moved, but isn't it >>> * entirely bogus with regard to the tag? We can't do anything with >>> * the buffer without taking BM_VALID / BM_TAG_VALID into >>> * account. Likely doesn't matter because we're already dirtying the >>> * cacheline, but still. >>> * >>> */ >>> ClearBufferTag(&buf_hdr->tag); >>> buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); >>> UnlockBufHdr(buf_hdr, buf_state); >> >> What exactly is wrong with clearing the tag? What does dirtying the >> cacheline have to do with the correctness here? > > There's nothing wrong with clearing out the tag, but I don't think it's a hard > requirement today, and certainly not for the reason stated above. > > Validity of the buffer isn't determined by the tag, it's determined by > BM_VALID (or, if you interpret valid more widely, BM_TAG_VALID). > > Without either having pinned the buffer, or holding the buffer header > spinlock, the tag can change at any time. And code like DropDatabaseBuffers() > knows that, and re-checks the the tag after locking the buffer header > spinlock. > > Afaict, there'd be no correctness issue with removing the > ClearBufferTag(). There would be an efficiency issue though, because when > encountering an invalid buffer, we'd unnecessarily enter InvalidateBuffer(), > which'd find that BM_[TAG_]VALID isn't set, and not to anything. Okay, gotcha. > Even though it's not a correctness issue, it seems to me that > DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID, > before doing anything further. Particularly in DropRelationsAllBuffers(), the > check we do for each buffer isn't cheap. Doing it for buffers that don't even > have a tag seems .. not smart. Depends on what percentage of buffers are valid, I guess. If all buffers are valid, checking BM_TAG_VALID first would lose. In practice, I doubt it makes any measurable difference either way. Since we're micro-optimizing, I noticed that BufTagMatchesRelFileLocator() compares the fields in order "spcOid, dbOid, relNumber". Before commit 82ac34db20, we used RelFileLocatorEqual(), which has this comment: /* * Note: RelFileLocatorEquals and RelFileLocatorBackendEquals compare relNumber * first since that is most likely to be different in two unequal * RelFileLocators. It is probably redundant to compare spcOid if the other * fields are found equal, but do it anyway to be sure. Likewise for checking * the backend ID in RelFileLocatorBackendEquals. */ So we lost that micro-optimization. Should we reorder the checks in BufTagMatchesRelFileLocator()? - Heikki
> v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch This looks straightforward. My only concern is that it changes the order that things happen at abort. Currently, AbortBufferIO() is called very early in AbortTransaction(), and this patch moves it much later. I don't see any immediate problems from that, but it feels scary. > @@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void) > static void > AtProcExit_Buffers(int code, Datum arg) > { > - AbortBufferIO(); > UnlockBuffers(); > > CheckForBufferLeaks(); Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on elog(FATAL)? Do we need to worry about that? - Heikki
> v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch > +static BlockNumber > +BulkExtendSharedRelationBuffered(Relation rel, > + SMgrRelation smgr, > + bool skip_extension_lock, > + char relpersistence, > + ForkNumber fork, ReadBufferMode mode, > + BufferAccessStrategy strategy, > + uint32 *num_pages, > + uint32 num_locked_pages, > + Buffer *buffers) Ugh, that's a lot of arguments, some are inputs and some are outputs. I don't have any concrete suggestions, but could we simplify this somehow? Needs a comment at least. > v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch > diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c > index de1427a1e0e..1810f7ebfef 100644 > --- a/src/backend/access/brin/brin.c > +++ b/src/backend/access/brin/brin.c > @@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo) > * whole relation will be rolled back. > */ > > - meta = ReadBuffer(index, P_NEW); > + meta = ExtendRelationBuffered(index, NULL, true, > + index->rd_rel->relpersistence, > + MAIN_FORKNUM, RBM_ZERO_AND_LOCK, > + NULL); > Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO); > - LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE); > > brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index), > BRIN_CURRENT_VERSION); Since we're changing the API anyway, how about introducing a new function for this case where we extend the relation but we what block number we're going to get? This pattern of using P_NEW and asserting the result has always felt awkward to me. > - buf = ReadBuffer(irel, P_NEW); > + buf = ExtendRelationBuffered(irel, NULL, false, > + irel->rd_rel->relpersistence, > + MAIN_FORKNUM, RBM_ZERO_AND_LOCK, > + NULL); These new calls are pretty verbose, compared to ReadBuffer(rel, P_NEW). I'd suggest something like: buf = ExtendBuffer(rel); Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with ExtendRelationBuffered? Is it ever possible to call this without a relcache entry? WAL redo functions do that with ReadBuffer, but they only extend a relation implicitly, by replay a record for a particular block. All of the above comments are around the BulkExtendRelationBuffered() function's API. That needs a closer look and a more thought-out design to make it nice. Aside from that, this approach seems valid. - Heikki
On 2023-Feb-21, Heikki Linnakangas wrote: > > +static BlockNumber > > +BulkExtendSharedRelationBuffered(Relation rel, > > + SMgrRelation smgr, > > + bool skip_extension_lock, > > + char relpersistence, > > + ForkNumber fork, ReadBufferMode mode, > > + BufferAccessStrategy strategy, > > + uint32 *num_pages, > > + uint32 num_locked_pages, > > + Buffer *buffers) > > Ugh, that's a lot of arguments, some are inputs and some are outputs. I > don't have any concrete suggestions, but could we simplify this somehow? > Needs a comment at least. Yeah, I noticed this too. I think it would be easy enough to add a new struct that can be passed as a pointer, which can be stack-allocated by the caller, and which holds the input arguments that are common to both functions, as is sensible. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Update: super-fast reaction on the Postgres bugs mailing list. The report was acknowledged [...], and a fix is under discussion. The wonders of open-source !" https://twitter.com/gunnarmorling/status/1596080409259003906
Hi, On 2023-02-21 17:40:31 +0200, Heikki Linnakangas wrote: > > v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch > > This looks straightforward. My only concern is that it changes the order > that things happen at abort. Currently, AbortBufferIO() is called very early > in AbortTransaction(), and this patch moves it much later. I don't see any > immediate problems from that, but it feels scary. Yea, it does feel a bit awkward. But I suspect it's actually the right thing. We've not even adjusted the transaction state at the point we're calling AbortBufferIO(). And AbortBufferIO() will sometimes allocate memory for a WARNING, which conceivably could fail - although I don't think that's a particularly realistic scenario due to TransactionAbortContext (I guess you could have a large error context stack or such). Medium term I think we need to move a lot more of the error handling into resowners. Having a dozen+ places with their own choreographed sigsetjmp() recovery blocks is error prone as hell. Not to mention tedious. > > @@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void) > > static void > > AtProcExit_Buffers(int code, Datum arg) > > { > > - AbortBufferIO(); > > UnlockBuffers(); > > CheckForBufferLeaks(); > > Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on > elog(FATAL)? Do we need to worry about that? We have before_shmem_exit() callbacks that should protect against that. InitPostgres() registers ShutdownPostgres(), and CreateAuxProcessResourceOwner() registers ReleaseAuxProcessResourcesCallback(). I think we'd already be in trouble if we didn't reliably end up doing resowner cleanup during process exit. Perhaps ResourceOwnerCreate()/ResourceOwnerDelete() should maintain a list of "active" resource owners and have a before-exit callback that ensures the list is empty and PANICs if not? Better a crash restart than hanging because we didn't release some shared resource. Greetings, Andres Freund
Hi, On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote: > > v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch > > > +static BlockNumber > > +BulkExtendSharedRelationBuffered(Relation rel, > > + SMgrRelation smgr, > > + bool skip_extension_lock, > > + char relpersistence, > > + ForkNumber fork, ReadBufferMode mode, > > + BufferAccessStrategy strategy, > > + uint32 *num_pages, > > + uint32 num_locked_pages, > > + Buffer *buffers) > > Ugh, that's a lot of arguments, some are inputs and some are outputs. I > don't have any concrete suggestions, but could we simplify this somehow? > Needs a comment at least. Yea. I think this is the part of the patchset I like the least. The ugliest bit is accepting both rel and smgr. The background to that is that we need the relation oid to acquire the extension lock. But during crash recovery we don't have that - which is fine, because we don't need the extension lock. We could have two different of functions, but that ends up a mess as well, as we've seen in other cases. > > v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch > > > diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c > > index de1427a1e0e..1810f7ebfef 100644 > > --- a/src/backend/access/brin/brin.c > > +++ b/src/backend/access/brin/brin.c > > @@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo) > > * whole relation will be rolled back. > > */ > > - meta = ReadBuffer(index, P_NEW); > > + meta = ExtendRelationBuffered(index, NULL, true, > > + index->rd_rel->relpersistence, > > + MAIN_FORKNUM, RBM_ZERO_AND_LOCK, > > + NULL); > > Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO); > > - LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE); > > brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index), > > BRIN_CURRENT_VERSION); > > Since we're changing the API anyway, how about introducing a new function > for this case where we extend the relation but we what block number we're > going to get? This pattern of using P_NEW and asserting the result has > always felt awkward to me. To me it always felt like a code smell that some code insists on specific getting specific block numbers with P_NEW. I guess it's ok for things like building a new index, but outside of that it feels wrong. The first case I found just now is revmap_physical_extend(). Which seems to extend the relation while holding an lwlock. Ugh. Maybe ExtendRelationBufferedTo() or something like that? With a big comment saying that users of it are likely bad ;) > > - buf = ReadBuffer(irel, P_NEW); > > + buf = ExtendRelationBuffered(irel, NULL, false, > > + irel->rd_rel->relpersistence, > > + MAIN_FORKNUM, RBM_ZERO_AND_LOCK, > > + NULL); > > These new calls are pretty verbose, compared to ReadBuffer(rel, P_NEW). I'd > suggest something like: I guess. Not sure if it's worth optimizing for brevity all that much here - there's not that many places extending relations. Several places end up with less code, actually , because they don't need to care about the extension lock themselves anymore. I think an ExtendBuffer() that doesn't mention the fork, etc, ends up being more confusing than helpful. > buf = ExtendBuffer(rel); Without the relation in the name it just seems confusing to me - the extension isn't "restricted" to shared_buffers. ReadBuffer() isn't great as a name either, but it makes a bit more sense at least, it reads into a buffer. And it's a vastly more frequent operation, so optimizing for density is worth it. > Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with > ExtendRelationBuffered? Hm. That's a a good point. Probably not. Perhaps it could be useful to support RBM_NORMAL as well? But even if, it'd just be a lock release away if we always used RBM_ZERO_AND_LOCK. > Is it ever possible to call this without a relcache entry? WAL redo > functions do that with ReadBuffer, but they only extend a relation > implicitly, by replay a record for a particular block. I think we should use it for crash recovery as well, but the patch doesn't yet. We have some gnarly code there, see the loop using P_NEW in XLogReadBufferExtended(). Extending the file one-by-one is a lot more expensive than doing it in bulk. > All of the above comments are around the BulkExtendRelationBuffered() > function's API. That needs a closer look and a more thought-out design to > make it nice. Aside from that, this approach seems valid. Thanks for looking! I agree that it can stand a fair bit of polishing... Greetings, Andres Freund
On 10/28/22 9:54 PM, Andres Freund wrote: > b) I found that is quite beneficial to bulk-extend the relation with > smgrextend() even without concurrency. The reason for that is the primarily > the aforementioned dirty buffers that our current extension method causes. > > One bit that stumped me for quite a while is to know how much to extend the > relation by. RelationGetBufferForTuple() drives the decision whether / how > much to bulk extend purely on the contention on the extension lock, which > obviously does not work for non-concurrent workloads. > > After quite a while I figured out that we actually have good information on > how much to extend by, at least for COPY / > heap_multi_insert(). heap_multi_insert() can compute how much space is > needed to store all tuples, and pass that on to > RelationGetBufferForTuple(). > > For that to be accurate we need to recompute that number whenever we use an > already partially filled page. That's not great, but doesn't appear to be a > measurable overhead. Some food for thought: I think it's also completely fine to extend any relation over a certain size by multiple blocks, regardless of concurrency. E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel for what algorithm would make sense here; maybe something along the lines of extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably extending by just a couple extra pages doesn't help much without concurrency).
Hi, On 2023-02-21 15:00:15 -0600, Jim Nasby wrote: > On 10/28/22 9:54 PM, Andres Freund wrote: > > b) I found that is quite beneficial to bulk-extend the relation with > > smgrextend() even without concurrency. The reason for that is the primarily > > the aforementioned dirty buffers that our current extension method causes. > > > > One bit that stumped me for quite a while is to know how much to extend the > > relation by. RelationGetBufferForTuple() drives the decision whether / how > > much to bulk extend purely on the contention on the extension lock, which > > obviously does not work for non-concurrent workloads. > > > > After quite a while I figured out that we actually have good information on > > how much to extend by, at least for COPY / > > heap_multi_insert(). heap_multi_insert() can compute how much space is > > needed to store all tuples, and pass that on to > > RelationGetBufferForTuple(). > > > > For that to be accurate we need to recompute that number whenever we use an > > already partially filled page. That's not great, but doesn't appear to be a > > measurable overhead. > Some food for thought: I think it's also completely fine to extend any > relation over a certain size by multiple blocks, regardless of concurrency. > E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel > for what algorithm would make sense here; maybe something along the lines of > extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably > extending by just a couple extra pages doesn't help much without > concurrency). I previously implemented just that. It's not easy to get right. You can easily end up with several backends each extending the relation by quite a bit, at the same time (or you re-introduce contention). Which can end up with a relation being larger by a bunch if data loading stops at some point. We might want that as well at some point, but the approach implemented in the patchset is precise and thus always a win, and thus should be the baseline. Greetings, Andres Freund
On 2/21/23 3:12 PM, Andres Freund wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. > > > > Hi, > > On 2023-02-21 15:00:15 -0600, Jim Nasby wrote: >> Some food for thought: I think it's also completely fine to extend any >> relation over a certain size by multiple blocks, regardless of concurrency. >> E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel >> for what algorithm would make sense here; maybe something along the lines of >> extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably >> extending by just a couple extra pages doesn't help much without >> concurrency). > I previously implemented just that. It's not easy to get right. You can easily > end up with several backends each extending the relation by quite a bit, at > the same time (or you re-introduce contention). Which can end up with a > relation being larger by a bunch if data loading stops at some point. > > We might want that as well at some point, but the approach implemented in the > patchset is precise and thus always a win, and thus should be the baseline. Yeah, what I was suggesting would only make sense when there *wasn't* contention.
On 21/02/2023 21:22, Andres Freund wrote: > On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote: >> Is it ever possible to call this without a relcache entry? WAL redo >> functions do that with ReadBuffer, but they only extend a relation >> implicitly, by replay a record for a particular block. > > I think we should use it for crash recovery as well, but the patch doesn't > yet. We have some gnarly code there, see the loop using P_NEW in > XLogReadBufferExtended(). Extending the file one-by-one is a lot more > expensive than doing it in bulk. Hmm, XLogReadBufferExtended() could use smgrzeroextend() to fill the gap, and then call ExtendRelationBuffered for the target page. Or the new ExtendRelationBufferedTo() function that you mentioned. In the common case that you load a lot of data to a relation extending it, and then crash, the WAL replay would still extend the relation one page at a time, which is inefficient. Changing that would need bigger changes, to WAL-log the relation extension as a separate WAL record, for example. I don't think we need to solve that right now, it can be addressed separately later. - Heikki
Hi, On 2023-02-22 11:18:57 +0200, Heikki Linnakangas wrote: > On 21/02/2023 21:22, Andres Freund wrote: > > On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote: > > > Is it ever possible to call this without a relcache entry? WAL redo > > > functions do that with ReadBuffer, but they only extend a relation > > > implicitly, by replay a record for a particular block. > > > > I think we should use it for crash recovery as well, but the patch doesn't > > yet. We have some gnarly code there, see the loop using P_NEW in > > XLogReadBufferExtended(). Extending the file one-by-one is a lot more > > expensive than doing it in bulk. > > Hmm, XLogReadBufferExtended() could use smgrzeroextend() to fill the gap, > and then call ExtendRelationBuffered for the target page. Or the new > ExtendRelationBufferedTo() function that you mentioned. I don't think it's safe to just use smgrzeroextend(). Without the page-level interlock from the buffer entry, a concurrent reader can read/write the extended portion of the relation, while we're extending. That can lead to loosing writes. It also turns out that just doing smgrzeroextend(), without filling s_b, is often bad for performance, because it may cause reads when trying to fill the buffers. Although hopefully that's less of an issue during WAL replay, due to REGBUF_WILL_INIT. > In the common case that you load a lot of data to a relation extending it, > and then crash, the WAL replay would still extend the relation one page at a > time, which is inefficient. Changing that would need bigger changes, to > WAL-log the relation extension as a separate WAL record, for example. I > don't think we need to solve that right now, it can be addressed separately > later. Yea, that seems indeed something for later. There's several things we could do without adding WAL logging of relation extension themselves. One relatively easy thing would be to add information about the number of blocks we're extending by to XLOG_HEAP2_MULTI_INSERT records. Compared to the insertions themselves that'd barely be noticable. A slightly more complicated thing would be to peek ahead in the WAL (we have infrastructure for that now) and extend by enough for the next few relation extensions. Greetings, Andres Freund
Hi, On 2023-02-21 11:22:26 -0800, Andres Freund wrote: > On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote: > > Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with > > ExtendRelationBuffered? > > Hm. That's a a good point. Probably not. Perhaps it could be useful to support > RBM_NORMAL as well? But even if, it'd just be a lock release away if we always > used RBM_ZERO_AND_LOCK. There's a fair number of callers using RBM_NORMAL, via ReadBuffer[Extended]() right now. While some of them are trivial to convert, others aren't (e.g., brin_getinsertbuffer()). So I'm inclined to continue allowing RBM_NORMAL. Greetings, Andres Freund
On 21/02/2023 18:33, Alvaro Herrera wrote: > On 2023-Feb-21, Heikki Linnakangas wrote: > >>> +static BlockNumber >>> +BulkExtendSharedRelationBuffered(Relation rel, >>> + SMgrRelation smgr, >>> + bool skip_extension_lock, >>> + char relpersistence, >>> + ForkNumber fork, ReadBufferMode mode, >>> + BufferAccessStrategy strategy, >>> + uint32 *num_pages, >>> + uint32 num_locked_pages, >>> + Buffer *buffers) >> >> Ugh, that's a lot of arguments, some are inputs and some are outputs. I >> don't have any concrete suggestions, but could we simplify this somehow? >> Needs a comment at least. > > Yeah, I noticed this too. I think it would be easy enough to add a new > struct that can be passed as a pointer, which can be stack-allocated > by the caller, and which holds the input arguments that are common to > both functions, as is sensible. We also do this in freespace.c and visibilitymap.c: /* Extend as needed. */ while (fsm_nblocks_now < fsm_nblocks) { PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now); smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now, pg.data, false); fsm_nblocks_now++; } We could use the new smgrzeroextend function here. But it would be better to go through the buffer cache, because after this, the last block, at 'fsm_nblocks', will be read with ReadBuffer() and modified. We could use BulkExtendSharedRelationBuffered() to extend the relation and keep the last page locked, but the BulkExtendSharedRelationBuffered() signature doesn't allow that. It can return the first N pages locked, but there's no way to return the *last* page locked. Perhaps we should decompose this function into several function calls. Something like: /* get N victim buffers, pinned and !BM_VALID */ buffers = BeginExtendRelation(int npages); LockRelationForExtension(rel) /* Insert buffers into buffer table */ first_blk = smgrnblocks() for (blk = first_blk; blk < last_blk; blk++) MapNewBuffer(blk, buffers[i]) /* extend the file on disk */ smgrzeroextend(); UnlockRelationForExtension(rel) for (blk = first_blk; blk < last_blk; blk++) { memset(BufferGetPage(buffers[i]), 0, FinishNewBuffer(buffers[i]) /* optionally lock the buffer */ LockBuffer(buffers[i]); } That's a lot more verbose, of course, but gives the callers the flexibility. And might even be more readable than one function call with lots of arguments. This would expose the concept of a buffer that's mapped but marked as IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing details that shouldn't be exposed. On the other hand, it might come handy. Instead of RBM_ZERO_AND_LOCK mode, for example, it might be handy to have a function that returns an IO-in-progress buffer that you can initialize any way you want. - Heikki
Hi, On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote: > We also do this in freespace.c and visibilitymap.c: > > /* Extend as needed. */ > while (fsm_nblocks_now < fsm_nblocks) > { > PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now); > > smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now, > pg.data, false); > fsm_nblocks_now++; > } > > We could use the new smgrzeroextend function here. But it would be better to > go through the buffer cache, because after this, the last block, at > 'fsm_nblocks', will be read with ReadBuffer() and modified. I doubt it's a particularly crucial thing to optimize. But, uh, isn't this code racy? Because this doesn't go through shared buffers, there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know that writing pages isn't atomic vs readers. So another connection could connection could see the new relation size, but a read might return a partially written state of the page. Which then would cause checksum failures. And even worse, I think it could lead to loosing a write, if the concurrent connection writes out a page. > We could use BulkExtendSharedRelationBuffered() to extend the relation and > keep the last page locked, but the BulkExtendSharedRelationBuffered() > signature doesn't allow that. It can return the first N pages locked, but > there's no way to return the *last* page locked. We can't rely on bulk extending a, potentially, large number of pages in one go anyway (since we might not be allowed to pin that many pages). So I don't think requiring locking the last page is a really viable API. I think for this case I'd just just use the ExtendRelationTo() API we were discussing nearby. Compared to the cost of reducing syscalls / filesystem overhead to extend the relation, the cost of the buffer mapping lookup does't seem significant. That's different in e.g. the hio.c case, because there we need a buffer with free space, and concurrent activity could otherwise fill up the buffer before we can lock it again. I had started hacking on ExtendRelationTo() that when I saw problems with the existing code that made me hesitate: https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de > Perhaps we should decompose this function into several function calls. > Something like: > > /* get N victim buffers, pinned and !BM_VALID */ > buffers = BeginExtendRelation(int npages); > > LockRelationForExtension(rel) > > /* Insert buffers into buffer table */ > first_blk = smgrnblocks() > for (blk = first_blk; blk < last_blk; blk++) > MapNewBuffer(blk, buffers[i]) > > /* extend the file on disk */ > smgrzeroextend(); > > UnlockRelationForExtension(rel) > > for (blk = first_blk; blk < last_blk; blk++) > { > memset(BufferGetPage(buffers[i]), 0, > FinishNewBuffer(buffers[i]) > /* optionally lock the buffer */ > LockBuffer(buffers[i]); > } > > That's a lot more verbose, of course, but gives the callers the flexibility. > And might even be more readable than one function call with lots of > arguments. To me this seems like a quite bad idea. The amount of complexity this would expose all over the tree is substantial. Which would also make it harder to further improve relation extension at a later date. It certainly shouldn't be the default interface. And I'm not sure I see a promisung usecase. > This would expose the concept of a buffer that's mapped but marked as > IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing details > that shouldn't be exposed. On the other hand, it might come handy. Instead > of RBM_ZERO_AND_LOCK mode, for example, it might be handy to have a function > that returns an IO-in-progress buffer that you can initialize any way you > want. I'd much rather encapsulate that in additional functions, or perhaps a callback that can make decisions about what to do. Greetings, Andres Freund
Hi, On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote: > On 2023-Feb-21, Heikki Linnakangas wrote: > > > > +static BlockNumber > > > +BulkExtendSharedRelationBuffered(Relation rel, > > > + SMgrRelation smgr, > > > + bool skip_extension_lock, > > > + char relpersistence, > > > + ForkNumber fork, ReadBufferMode mode, > > > + BufferAccessStrategy strategy, > > > + uint32 *num_pages, > > > + uint32 num_locked_pages, > > > + Buffer *buffers) > > > > Ugh, that's a lot of arguments, some are inputs and some are outputs. I > > don't have any concrete suggestions, but could we simplify this somehow? > > Needs a comment at least. > > Yeah, I noticed this too. I think it would be easy enough to add a new > struct that can be passed as a pointer, which can be stack-allocated > by the caller, and which holds the input arguments that are common to > both functions, as is sensible. I played a fair bit with various options. I ended up not using a struct to pass most options, but instead go for a flags argument. However, I did use a struct for passing either relation or smgr. typedef enum ExtendBufferedFlags { EB_SKIP_EXTENSION_LOCK = (1 << 0), EB_IN_RECOVERY = (1 << 1), EB_CREATE_FORK_IF_NEEDED = (1 << 2), EB_LOCK_FIRST = (1 << 3), /* internal flags follow */ EB_RELEASE_PINS = (1 << 4), } ExtendBufferedFlags; /* * To identify the relation - either relation or smgr + relpersistence has to * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us * to use the same function for both crash recovery and normal operation. */ typedef struct ExtendBufferedWhat { Relation rel; struct SMgrRelationData *smgr; char relpersistence; } ExtendBufferedWhat; #define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel}) /* requires use of EB_SKIP_EXTENSION_LOCK */ #define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence}) extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb, ForkNumber forkNum, BufferAccessStrategy strategy, uint32 flags); extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb, ForkNumber fork, BufferAccessStrategy strategy, uint32 flags, uint32 extend_by, Buffer *buffers, uint32 *extended_by); extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb, ForkNumber fork, BufferAccessStrategy strategy, uint32 flags, BlockNumber extend_to, ReadBufferMode mode); As you can see I removed ReadBufferMode from most of the functions (as suggested by Heikki earlier). When extending by 1/multiple pages, we only need to know whether to lock or not. The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to fall back to reading page normally if there was a concurrent relation extension. The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated, gnarly, code to do so from vm_extend(), fsm_extend(). I'm not sure about the function naming pattern. I do like 'By' a lot more than the Bulk prefix I used before. What do you think? Greetings, Andres Freund
On 27/02/2023 23:45, Andres Freund wrote: > On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote: >> We also do this in freespace.c and visibilitymap.c: >> >> /* Extend as needed. */ >> while (fsm_nblocks_now < fsm_nblocks) >> { >> PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now); >> >> smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now, >> pg.data, false); >> fsm_nblocks_now++; >> } >> >> We could use the new smgrzeroextend function here. But it would be better to >> go through the buffer cache, because after this, the last block, at >> 'fsm_nblocks', will be read with ReadBuffer() and modified. > > I doubt it's a particularly crucial thing to optimize. Yeah, it won't make any practical difference to performance. I'm more thinking if we can make this more consistent with other places where we extend a relation. > But, uh, isn't this code racy? Because this doesn't go through shared buffers, > there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know > that writing pages isn't atomic vs readers. So another connection could > connection could see the new relation size, but a read might return a > partially written state of the page. Which then would cause checksum > failures. And even worse, I think it could lead to loosing a write, if the > concurrent connection writes out a page. fsm_readbuf and vm_readbuf check the relation size first, with smgrnblocks(), before trying to read the page. So to have a problem, the smgrnblocks() would have to already return the new size, but the smgrread() would not return the new contents. I don't think that's possible, but not sure. >> We could use BulkExtendSharedRelationBuffered() to extend the relation and >> keep the last page locked, but the BulkExtendSharedRelationBuffered() >> signature doesn't allow that. It can return the first N pages locked, but >> there's no way to return the *last* page locked. > > We can't rely on bulk extending a, potentially, large number of pages in one > go anyway (since we might not be allowed to pin that many pages). So I don't > think requiring locking the last page is a really viable API. > > I think for this case I'd just just use the ExtendRelationTo() API we were > discussing nearby. Compared to the cost of reducing syscalls / filesystem > overhead to extend the relation, the cost of the buffer mapping lookup does't > seem significant. That's different in e.g. the hio.c case, because there we > need a buffer with free space, and concurrent activity could otherwise fill up > the buffer before we can lock it again. Works for me. - Heikki
Hi, On 2023-03-01 11:12:35 +0200, Heikki Linnakangas wrote: > On 27/02/2023 23:45, Andres Freund wrote: > > But, uh, isn't this code racy? Because this doesn't go through shared buffers, > > there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know > > that writing pages isn't atomic vs readers. So another connection could > > connection could see the new relation size, but a read might return a > > partially written state of the page. Which then would cause checksum > > failures. And even worse, I think it could lead to loosing a write, if the > > concurrent connection writes out a page. > > fsm_readbuf and vm_readbuf check the relation size first, with > smgrnblocks(), before trying to read the page. So to have a problem, the > smgrnblocks() would have to already return the new size, but the smgrread() > would not return the new contents. I don't think that's possible, but not > sure. I hacked Thomas' program to test torn reads to ftruncate the file on the write side. It frequently observes a file size that's not the write size (e.g. reading 4k when writing an 8k block). After extending the test to more than one reader, I indeed also see torn reads. So far all the tears have been at a 4k block boundary. However so far it always has been *prior* page contents, not 0s. Greetings, Andres Freund
Hi, On 2023-03-01 09:02:00 -0800, Andres Freund wrote: > On 2023-03-01 11:12:35 +0200, Heikki Linnakangas wrote: > > On 27/02/2023 23:45, Andres Freund wrote: > > > But, uh, isn't this code racy? Because this doesn't go through shared buffers, > > > there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know > > > that writing pages isn't atomic vs readers. So another connection could > > > connection could see the new relation size, but a read might return a > > > partially written state of the page. Which then would cause checksum > > > failures. And even worse, I think it could lead to loosing a write, if the > > > concurrent connection writes out a page. > > > > fsm_readbuf and vm_readbuf check the relation size first, with > > smgrnblocks(), before trying to read the page. So to have a problem, the > > smgrnblocks() would have to already return the new size, but the smgrread() > > would not return the new contents. I don't think that's possible, but not > > sure. > > I hacked Thomas' program to test torn reads to ftruncate the file on the write > side. > > It frequently observes a file size that's not the write size (e.g. reading 4k > when writing an 8k block). > > After extending the test to more than one reader, I indeed also see torn > reads. So far all the tears have been at a 4k block boundary. However so far > it always has been *prior* page contents, not 0s. On tmpfs the failure rate is much higher, and we also end up reading 0s, despite never writing them. I've attached my version of the test program. ext4: lots of 4k reads with 8k writes, some torn reads at 4k boundaries xfs: no issues tmpfs: loads of 4k reads with 8k writes, lots torn reads reading 0s, some torn reads at 4k boundaries Greetings, Andres Freund
Attachment
On 01/03/2023 09:33, Andres Freund wrote: > On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote: >> On 2023-Feb-21, Heikki Linnakangas wrote: >> >>>> +static BlockNumber >>>> +BulkExtendSharedRelationBuffered(Relation rel, >>>> + SMgrRelation smgr, >>>> + bool skip_extension_lock, >>>> + char relpersistence, >>>> + ForkNumber fork, ReadBufferMode mode, >>>> + BufferAccessStrategy strategy, >>>> + uint32 *num_pages, >>>> + uint32 num_locked_pages, >>>> + Buffer *buffers) >>> >>> Ugh, that's a lot of arguments, some are inputs and some are outputs. I >>> don't have any concrete suggestions, but could we simplify this somehow? >>> Needs a comment at least. >> >> Yeah, I noticed this too. I think it would be easy enough to add a new >> struct that can be passed as a pointer, which can be stack-allocated >> by the caller, and which holds the input arguments that are common to >> both functions, as is sensible. > > I played a fair bit with various options. I ended up not using a struct to > pass most options, but instead go for a flags argument. However, I did use a > struct for passing either relation or smgr. > > > typedef enum ExtendBufferedFlags { > EB_SKIP_EXTENSION_LOCK = (1 << 0), > EB_IN_RECOVERY = (1 << 1), > EB_CREATE_FORK_IF_NEEDED = (1 << 2), > EB_LOCK_FIRST = (1 << 3), > > /* internal flags follow */ > EB_RELEASE_PINS = (1 << 4), > } ExtendBufferedFlags; Is EB_IN_RECOVERY always set when RecoveryInProgress()? Is it really needed? What does EB_LOCK_FIRST do? > /* > * To identify the relation - either relation or smgr + relpersistence has to > * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us > * to use the same function for both crash recovery and normal operation. > */ > typedef struct ExtendBufferedWhat > { > Relation rel; > struct SMgrRelationData *smgr; > char relpersistence; > } ExtendBufferedWhat; > > #define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel}) > /* requires use of EB_SKIP_EXTENSION_LOCK */ > #define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence}) Clever. I'm still not 100% convinced we need the EB_SMGR variant, but with this we'll have the flexibility in any case. > extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb, > ForkNumber forkNum, > BufferAccessStrategy strategy, > uint32 flags); > extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb, > ForkNumber fork, > BufferAccessStrategy strategy, > uint32 flags, > uint32 extend_by, > Buffer *buffers, > uint32 *extended_by); > extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb, > ForkNumber fork, > BufferAccessStrategy strategy, > uint32 flags, > BlockNumber extend_to, > ReadBufferMode mode); > > As you can see I removed ReadBufferMode from most of the functions (as > suggested by Heikki earlier). When extending by 1/multiple pages, we only need > to know whether to lock or not. Ok, that's better. Still complex and a lot of arguments, but I don't have any great suggestions on how to improve it. > The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to > fall back to reading page normally if there was a concurrent relation > extension. Hmm, I think you'll need another return value, to let the caller know if the relation was extended or not. Or a flag to ereport(ERROR) if the page already exists, for ginbuild() and friends. > The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated, > gnarly, code to do so from vm_extend(), fsm_extend(). Makes sense. > I'm not sure about the function naming pattern. I do like 'By' a lot more than > the Bulk prefix I used before. +1 - Heikki
Hi, On 2023-03-02 00:04:14 +0200, Heikki Linnakangas wrote: > On 01/03/2023 09:33, Andres Freund wrote: > > typedef enum ExtendBufferedFlags { > > EB_SKIP_EXTENSION_LOCK = (1 << 0), > > EB_IN_RECOVERY = (1 << 1), > > EB_CREATE_FORK_IF_NEEDED = (1 << 2), > > EB_LOCK_FIRST = (1 << 3), > > > > /* internal flags follow */ > > EB_RELEASE_PINS = (1 << 4), > > } ExtendBufferedFlags; > > Is EB_IN_RECOVERY always set when RecoveryInProgress()? Is it really needed? Right now it's just passed in from the caller. It's at the moment just needed to know what to pass to smgrcreate(isRedo). However, XLogReadBufferExtended() doesn't currently use this path, so maybe it's not actually needed. > What does EB_LOCK_FIRST do? Lock the first returned buffer, this is basically the replacement for num_locked_buffers from the earlier version. I think it's likely that either locking the first, or potentially at some later point locking all buffers, is all that's needed for ExtendBufferedRelBy(). EB_LOCK_FIRST_BUFFER would perhaps be better? > > /* > > * To identify the relation - either relation or smgr + relpersistence has to > > * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us > > * to use the same function for both crash recovery and normal operation. > > */ > > typedef struct ExtendBufferedWhat > > { > > Relation rel; > > struct SMgrRelationData *smgr; > > char relpersistence; > > } ExtendBufferedWhat; > > > > #define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel}) > > /* requires use of EB_SKIP_EXTENSION_LOCK */ > > #define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence}) > > Clever. I'm still not 100% convinced we need the EB_SMGR variant, but with > this we'll have the flexibility in any case. Hm - how would you use it from XLogReadBufferExtended() without that? XLogReadBufferExtended() spends a disappointing amount of time in smgropen(). Quite visible in profiles. In the plan read case at least one in XLogReadBufferExtended() itself, then in ReadBufferWithoutRelcache(). The extension path right now is worse - it does one smgropen() for each extended block. I think we should avoid using ReadBufferWithoutRelcache() in XLogReadBufferExtended() in the read path as well, but that's for later. > > extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb, > > ForkNumber forkNum, > > BufferAccessStrategy strategy, > > uint32 flags); > > extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb, > > ForkNumber fork, > > BufferAccessStrategy strategy, > > uint32 flags, > > uint32 extend_by, > > Buffer *buffers, > > uint32 *extended_by); > > extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb, > > ForkNumber fork, > > BufferAccessStrategy strategy, > > uint32 flags, > > BlockNumber extend_to, > > ReadBufferMode mode); > > > > As you can see I removed ReadBufferMode from most of the functions (as > > suggested by Heikki earlier). When extending by 1/multiple pages, we only need > > to know whether to lock or not. > > Ok, that's better. Still complex and a lot of arguments, but I don't have > any great suggestions on how to improve it. I don't think there are going to be all that many callers of ExtendBufferedRelBy() and ExtendBufferedRelTo(), most are going to be ExtendBufferedRel(), I think. So the complexity seems acceptable. > > The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to > > fall back to reading page normally if there was a concurrent relation > > extension. > > Hmm, I think you'll need another return value, to let the caller know if the > relation was extended or not. Or a flag to ereport(ERROR) if the page > already exists, for ginbuild() and friends. I don't think ginbuild() et al need to use ExtendBufferedRelTo()? A plain ExtendBufferedRel() should suffice. The places that do need it are fsm_extend() and vm_extend() - I did end up avoiding the redundant lookup. But I was wondering about a flag controlling this as well. Attached is my current version of this. Still needs more polishing, including comments explaining the flags. But I thought it'd be useful to have it out there. There's two new patches in the series: - a patch to not initialize pages in the loop in fsm_extend(), vm_extend() anymore - we have a check about initializing pages at a later point, so there doesn't really seem to be a need for it? - a patch to use the new ExtendBufferedRelTo() in fsm_extend(), vm_extend() and XLogReadBufferExtended() In this version I also tries to address some of the other feedback raised in the thread. One thing I haven't decided what to do about yet is David's feedback about a version of PinLocalBuffer() that doesn't adjust the usagecount, which wouldn't need to read the buf_state. Greetings, Andres Freund
Attachment
- v4-0001-Revise-pg_pwrite_zeros.patch
- v4-0002-Add-some-error-checking-around-pinning.patch
- v4-0003-hio-Release-extension-lock-before-initializing-pa.patch
- v4-0004-Add-smgrzeroextend-FileZero-FileFallocate.patch
- v4-0005-bufmgr-Add-Pin-UnpinLocalBuffer.patch
- v4-0006-bufmgr-Remove-buffer-write-dirty-tracepoints.patch
- v4-0007-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
- v4-0008-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch
- v4-0009-bufmgr-Move-relation-extension-handling-into-Exte.patch
- v4-0010-Convert-a-few-places-to-ExtendBufferedRel.patch
- v4-0011-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch
- v4-0012-hio-Use-ExtendBufferedRelBy.patch
- v4-0013-bufmgr-debug-Add-PrintBuffer-Desc.patch
- v4-0014-WIP-Don-t-initialize-page-in-vm-fsm-_extend-not-n.patch
- v4-0015-Convert-a-few-places-to-ExtendBufferedRelTo.patch
Hi, On 2022-10-28 19:54:20 -0700, Andres Freund wrote: > I've done a fair bit of benchmarking of this patchset. For COPY it comes out > ahead everywhere. It's possible that there's a very small regression for > extremly IO miss heavy workloads, more below. > > > server "base" configuration: > > max_wal_size=150GB > shared_buffers=24GB > huge_pages=on > autovacuum=0 > backend_flush_after=2MB > max_connections=5000 > wal_buffers=128MB > wal_segment_size=1GB > > benchmark: pgbench running COPY into a single table. pgbench -t is set > according to the client count, so that the same amount of data is inserted. > This is done oth using small files ([1], ringbuffer not effective, no dirty > data to write out within the benchmark window) and a bit larger files ([2], > lots of data to write out due to ringbuffer). > > To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well. > > s_b=24GB > > test: unlogged_small_files, format: text, files: 1024, 9015MB total > seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs > clients HEAD HEAD patch patch no_fsm no_fsm > 1 58.63 207 50.22 242 54.35 224 > 2 32.67 372 25.82 472 27.30 446 > 4 22.53 540 13.30 916 14.33 851 > 8 15.14 804 7.43 1640 7.48 1632 > 16 14.69 829 4.79 2544 4.50 2718 > 32 15.28 797 4.41 2763 3.32 3710 > 64 15.34 794 5.22 2334 3.06 4061 > 128 15.49 786 4.97 2452 3.13 3926 > 256 15.85 768 5.02 2427 3.26 3769 > 512 16.02 760 5.29 2303 3.54 3471 I just spent a few hours trying to reproduce these benchmark results. For the longest time I could not get the numbers for *HEAD* to even get close to the above, while the numbers for the patch were very close. I was worried it was a performance regression in HEAD etc. But no, same git commit as back then produces the same issue. As it turns out, I somehow screwed up my benchmark tooling, and I did not set set the CPU "scaling_governor" and "energy_performance_preference" to "performance". In a crazy turn of events, that approximately makes no difference with the patch applied, and a 2x difference for HEAD. I suspect this is some pathological issue when encountering heavy lock contention (likely leading to the CPU reducing speed into a deeper state, which then takes longer to get out of when the lock is released). As the lock contention is drastically reduced with the patch, that affect is not visible anymore. After fixing the performance scaling issue, the results are quite close to the above numbers again... Aargh, I want my afternoon back. Greetings, Andres Freund
Hi, Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the relation extension related code in hio.c back into its own function. While reviewing the hio.c code, I did realize that too much stuff is done while holding the buffer lock. See also the pre-existing issue https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de Greetings, Andres Freund
Attachment
- v5-0001-Add-some-error-checking-around-pinning.patch
- v5-0002-hio-Release-extension-lock-before-initializing-pa.patch
- v5-0003-Add-smgrzeroextend-FileZero-FileFallocate.patch
- v5-0004-bufmgr-Add-Pin-UnpinLocalBuffer.patch
- v5-0005-bufmgr-Remove-buffer-write-dirty-tracepoints.patch
- v5-0006-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
- v5-0007-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch
- v5-0008-bufmgr-Move-relation-extension-handling-into-Exte.patch
- v5-0009-Convert-a-few-places-to-ExtendBufferedRel.patch
- v5-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch
- v5-0011-hio-Use-ExtendBufferedRelBy.patch
- v5-0012-WIP-Don-t-initialize-page-in-vm-fsm-_extend-not-n.patch
- v5-0013-Convert-a-few-places-to-ExtendBufferedRelTo.patch
Hi, Below is my review of a slightly older version than you just posted -- much of it you may have already addressed. From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Mon, 20 Mar 2023 21:57:40 -0700 Subject: [PATCH v5 01/15] createdb-using-wal-fixes This could use a more detailed commit message -- I don't really get what it is doing From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Wed, 1 Jul 2020 19:06:45 -0700 Subject: [PATCH v5 02/15] Add some error checking around pinning --- src/backend/storage/buffer/bufmgr.c | 40 ++++++++++++++++++++--------- src/include/storage/bufmgr.h | 1 + 2 files changed, 29 insertions(+), 12 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 95212a3941..fa20fab5a2 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer) LW_EXCLUSIVE); } +void +BufferCheckOneLocalPin(Buffer buffer) +{ + if (BufferIsLocal(buffer)) + { + /* There should be exactly one pin */ + if (LocalRefCount[-buffer - 1] != 1) + elog(ERROR, "incorrect local pin count: %d", + LocalRefCount[-buffer - 1]); + } + else + { + /* There should be exactly one local pin */ + if (GetPrivateRefCount(buffer) != 1) I'd rather this be an else if (was already like this, but, no reason not to change it now) + elog(ERROR, "incorrect local pin count: %d", + GetPrivateRefCount(buffer)); + } +} From 00d3044770478eea31e00fee8d1680f22ca6adde Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Mon, 27 Feb 2023 17:36:37 -0800 Subject: [PATCH v5 04/15] Add smgrzeroextend(), FileZero(), FileFallocate() diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 9fd8444ed4..c34ed41d52 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2206,6 +2206,92 @@ FileSync(File file, uint32 wait_event_info) return returnCode; } +/* + * Zero a region of the file. + * + * Returns 0 on success, -1 otherwise. In the latter case errno is set to the + * appropriate error. + */ +int +FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info) +{ + int returnCode; + ssize_t written; + + Assert(FileIsValid(file)); + returnCode = FileAccess(file); + if (returnCode < 0) + return returnCode; + + pgstat_report_wait_start(wait_event_info); + written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset); + pgstat_report_wait_end(); + + if (written < 0) + return -1; + else if (written != amount) this doesn't need to be an else if + { + /* if errno is unset, assume problem is no disk space */ + if (errno == 0) + errno = ENOSPC; + return -1; + } +int +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info) +{ +#ifdef HAVE_POSIX_FALLOCATE + int returnCode; + + Assert(FileIsValid(file)); + returnCode = FileAccess(file); + if (returnCode < 0) + return returnCode; + + pgstat_report_wait_start(wait_event_info); + returnCode = posix_fallocate(VfdCache[file].fd, offset, amount); + pgstat_report_wait_end(); + + if (returnCode == 0) + return 0; + + /* for compatibility with %m printing etc */ + errno = returnCode; + + /* + * Return in cases of a "real" failure, if fallocate is not supported, + * fall through to the FileZero() backed implementation. + */ + if (returnCode != EINVAL && returnCode != EOPNOTSUPP) + return returnCode; I'm pretty sure you can just delete the below if statement + if (returnCode == 0 || + (returnCode != EINVAL && returnCode != EINVAL)) + return returnCode; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 352958e1fe..59a65a8305 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -28,6 +28,7 @@ #include "access/xlog.h" #include "access/xlogutils.h" #include "commands/tablespace.h" +#include "common/file_utils.h" #include "miscadmin.h" #include "pg_trace.h" #include "pgstat.h" @@ -500,6 +501,116 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); } +/* + * mdzeroextend() -- Add ew zeroed out blocks to the specified relation. not sure what ew is + * + * Similar to mdrextend(), except the relation can be extended by mdrextend->mdextend + * multiple blocks at once, and that the added blocks will be filled with I would lose the comma and just say "and the added blocks will be filled..." +void +mdzeroextend(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, int nblocks, bool skipFsync) So, I think there are a few too many local variables in here, and it actually makes it more confusing. Assuming you would like to keep the input parameters blocknum and nblocks unmodified for debugging/other reasons, here is a suggested refactor of this function Also, I think you can combine the two error cases (I don't know if the user cares what you were trying to extend the file with). I've done this below also. void mdzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks, bool skipFsync) { MdfdVec *v; BlockNumber curblocknum = blocknum; int remblocks = nblocks; Assert(nblocks > 0); /* This assert is too expensive to have on normally ... */ #ifdef CHECK_WRITE_VS_EXTEND Assert(blocknum >= mdnblocks(reln, forknum)); #endif /* * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any * more --- we mustn't create a block whose number actually is * InvalidBlockNumber or larger. */ if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("cannot extend file \"%s\" beyond %u blocks", relpath(reln->smgr_rlocator, forknum), InvalidBlockNumber))); while (remblocks > 0) { int segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE); int numblocks = remblocks; off_t seekpos = (off_t) BLCKSZ * segstartblock; int ret; if (segstartblock + remblocks > RELSEG_SIZE) numblocks = RELSEG_SIZE - segstartblock; v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE); /* * If available and useful, use posix_fallocate() (via FileAllocate()) * to extend the relation. That's often more efficient than using * write(), as it commonly won't cause the kernel to allocate page * cache space for the extended pages. * * However, we don't use FileAllocate() for small extensions, as it * defeats delayed allocation on some filesystems. Not clear where * that decision should be made though? For now just use a cutoff of * 8, anything between 4 and 8 worked OK in some local testing. */ if (numblocks > 8) ret = FileFallocate(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * numblocks, WAIT_EVENT_DATA_FILE_EXTEND); else /* * Even if we don't want to use fallocate, we can still extend a * bit more efficiently than writing each 8kB block individually. * pg_pwrite_zeroes() (via FileZero()) uses * pg_pwritev_with_retry() to avoid multiple writes or needing a * zeroed buffer for the whole length of the extension. */ ret = FileZero(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * numblocks, WAIT_EVENT_DATA_FILE_EXTEND); if (ret != 0) ereport(ERROR, errcode_for_file_access(), errmsg("could not extend file \"%s\": %m", FilePathName(v->mdfd_vfd)), errhint("Check free disk space.")); if (!skipFsync && !SmgrIsTemp(reln)) register_dirty_segment(reln, forknum, v); Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); remblocks -= numblocks; curblocknum += numblocks; } } diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index dc466e5414..5224ca5259 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -50,6 +50,8 @@ typedef struct f_smgr +/* + * smgrzeroextend() -- Add new zeroed out blocks to a file. + * + * Similar to smgrextend(), except the relation can be extended by + * multiple blocks at once, and that the added blocks will be filled with + * zeroes. + */ Similar grammatical feedback as mdzeroextend. From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Wed, 26 Oct 2022 12:05:07 -0700 Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer() diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index fa20fab5a2..6f50dbd212 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer) } void -BufferCheckOneLocalPin(Buffer buffer) +BufferCheckWePinOnce(Buffer buffer) This name is weird. Who is we? diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 5325ddb663..798c5b93a8 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c +bool +PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount) +{ + uint32 buf_state; + Buffer buffer = BufferDescriptorGetBuffer(buf_hdr); + int bufid = -(buffer + 1); You do int buffid = -buffer - 1; in UnpinLocalBuffer() They should be consistent. int bufid = -(buffer + 1); I think this version is better: int buffid = -buffer - 1; Since if buffer is INT_MAX, then the -(buffer + 1) version invokes undefined behavior while the -buffer - 1 version doesn't. From a0228218e2ac299aac754eeb5b2be7ddfc56918d Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Fri, 17 Feb 2023 18:26:34 -0800 Subject: [PATCH v5 07/15] bufmgr: Acquire and clean victim buffer separately Previously we held buffer locks for two buffer mapping partitions at the same time to change the identity of buffers. Particularly for extending relations needing to hold the extension lock while acquiring a victim buffer is painful. By separating out the victim buffer acquisition, future commits will be able to change relation extensions to scale better. diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 3d0683593f..ea423ae484 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, /* * Buffer contents are currently invalid. Try to obtain the right to * start I/O. If StartBufferIO returns false, then someone else managed * to read it before we did, so there's nothing left for BufferAlloc() to * do. */ - if (StartBufferIO(buf, true)) + if (StartBufferIO(victim_buf_hdr, true)) *foundPtr = false; else *foundPtr = true; I know it was already like this, but since you edited the line already, can we just make this this now? *foundPtr = !StartBufferIO(victim_buf_hdr, true); @@ -1595,6 +1413,237 @@ retry: StrategyFreeBuffer(buf); } +/* + * Helper routine for GetVictimBuffer() + * + * Needs to be called on a buffer with a valid tag, pinned, but without the + * buffer header spinlock held. + * + * Returns true if the buffer can be reused, in which case the buffer is only + * pinned by this backend and marked as invalid, false otherwise. + */ +static bool +InvalidateVictimBuffer(BufferDesc *buf_hdr) +{ + /* + * Clear out the buffer's tag and flags and usagecount. This is not + * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before + * doing anything with the buffer. But currently it's beneficial as the + * pre-check for several linear scans of shared buffers just checks the + * tag. I don't really understand the above comment -- mainly the last sentence. +static Buffer +GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context) +{ + BufferDesc *buf_hdr; + Buffer buf; + uint32 buf_state; + bool from_ring; + + /* + * Ensure, while the spinlock's not yet held, that there's a free refcount + * entry. + */ + ReservePrivateRefCountEntry(); + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + /* we return here if a prospective victim buffer gets used concurrently */ +again: Why use goto instead of a loop here (again is the goto label)? From a7597b79dffaf96807f4a9beea0a39634530298d Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Mon, 24 Oct 2022 16:44:16 -0700 Subject: [PATCH v5 08/15] bufmgr: Support multiple in-progress IOs by using resowner Commit message should describe why we couldn't support multiple in-progress IOs before, I think (e.g. we couldn't be sure that we cleared IO_IN_PROGRESS if something happened). @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits) { uint32 buf_state; I noticed that the comment above TermianteBufferIO() says * TerminateBufferIO: release a buffer we were doing I/O on * (Assumptions) * My process is executing IO for the buffer Can we still say this is an assumption? What about when it is being cleaned up after being called from AbortBufferIO() diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index 19b6241e45..fccc59b39d 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -121,6 +121,7 @@ typedef struct ResourceOwnerData /* We have built-in support for remembering: */ ResourceArray bufferarr; /* owned buffers */ + ResourceArray bufferioarr; /* in-progress buffer IO */ ResourceArray catrefarr; /* catcache references */ ResourceArray catlistrefarr; /* catcache-list pins */ ResourceArray relrefarr; /* relcache references */ @@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) Maybe worth mentioning in-progress buffer IO in resowner README? I know it doesn't claim to be exhaustive, so, up to you. Also, I realize that existing code in this file has the extraneous parantheses, but maybe it isn't worth staying consistent with that? as in: &(owner->bufferioarr) + */ +void +ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer) +{ + ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer)); +} + From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Wed, 1 Mar 2023 13:24:19 -0800 Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into ExtendBufferedRel{By,To,} diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 3c95b87bca..4e07a5bc48 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c +/* + * Extend relation by multiple blocks. + * + * Tries to extend the relation by extend_by blocks. Depending on the + * availability of resources the relation may end up being extended by a + * smaller number of pages (unless an error is thrown, always by at least one + * page). *extended_by is updated to the number of pages the relation has been + * extended to. + * + * buffers needs to be an array that is at least extend_by long. Upon + * completion, the first extend_by array elements will point to a pinned + * buffer. + * + * If EB_LOCK_FIRST is part of flags, the first returned buffer is + * locked. This is useful for callers that want a buffer that is guaranteed to + * be empty. This should document what the returned BlockNumber is. Also, instead of having extend_by and extended_by, how about just having one which is set by the caller to the desired number to extend by and then overwritten in this function to the value it successfully extended by. It would be nice if the function returned the number it extended by instead of the BlockNumber. + */ +BlockNumber +ExtendBufferedRelBy(ExtendBufferedWhat eb, + ForkNumber fork, + BufferAccessStrategy strategy, + uint32 flags, + uint32 extend_by, + Buffer *buffers, + uint32 *extended_by) +{ + Assert((eb.rel != NULL) ^ (eb.smgr != NULL)); Can we turn these into != Assert((eb.rel != NULL) != (eb.smgr != NULL)); since it is easier to understand. (it is also in ExtendBufferedRelTo()) + Assert(eb.smgr == NULL || eb.relpersistence != 0); + Assert(extend_by > 0); + + if (eb.smgr == NULL) + { + eb.smgr = RelationGetSmgr(eb.rel); + eb.relpersistence = eb.rel->rd_rel->relpersistence; + } + + return ExtendBufferedRelCommon(eb, fork, strategy, flags, + extend_by, InvalidBlockNumber, + buffers, extended_by); +} + * Extend the relation so it is at least extend_to blocks large, read buffer Use of "read buffer" here is confusing. We only read the block if, after we try extending the relation, someone else already did so and we have to read the block they extended in, right? + * (extend_to - 1). + * + * This is useful for callers that want to write a specific page, regardless + * of the current size of the relation (e.g. useful for visibilitymap and for + * crash recovery). + */ +Buffer +ExtendBufferedRelTo(ExtendBufferedWhat eb, + ForkNumber fork, + BufferAccessStrategy strategy, + uint32 flags, + BlockNumber extend_to, + ReadBufferMode mode) +{ + while (current_size < extend_to) + { Can declare buffers variable here. + Buffer buffers[64]; + uint32 num_pages = lengthof(buffers); + BlockNumber first_block; + + if ((uint64) current_size + num_pages > extend_to) + num_pages = extend_to - current_size; + + first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags, + num_pages, extend_to, + buffers, &extended_by); + + current_size = first_block + extended_by; + Assert(current_size <= extend_to); + Assert(num_pages != 0 || current_size >= extend_to); + + for (int i = 0; i < extended_by; i++) + { + if (first_block + i != extend_to - 1) Is there a way we could avoid pinning these other buffers to begin with (e.g. passing a parameter to ExtendBufferedRelCommon()) + ReleaseBuffer(buffers[i]); + else + buffer = buffers[i]; + } + } + /* + * It's possible that another backend concurrently extended the + * relation. In that case read the buffer. + * + * XXX: Should we control this via a flag? + */ I feel like there needs to be a more explicit comment about how you could end up in this situation -- e.g. someone else extends the relation and so smgrnblocks returns a value that is greater than extend_to, so buffer stays InvalidBuffer + if (buffer == InvalidBuffer) + { + bool hit; + + Assert(extended_by == 0); + buffer = ReadBuffer_common(eb.smgr, eb.relpersistence, + fork, extend_to - 1, mode, strategy, + &hit); + } + + return buffer; +} Do we use compound literals? Here, this could be: buffer = ReadBuffer_common(eb.smgr, eb.relpersistence, fork, extend_to - 1, mode, strategy, &(bool) {0}); To eliminate the extraneous hit variable. /* * ReadBuffer_common -- common logic for all ReadBuffer variants @@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, bool found; IOContext io_context; IOObject io_object; - bool isExtend; bool isLocalBuf = SmgrIsTemp(smgr); *hit = false; + /* + * Backward compatibility path, most code should use + * ExtendRelationBuffered() instead, as acquiring the extension lock + * inside ExtendRelationBuffered() scales a lot better. Think these are old function names in the comment +static BlockNumber +ExtendBufferedRelShared(ExtendBufferedWhat eb, + ForkNumber fork, + BufferAccessStrategy strategy, + uint32 flags, + uint32 extend_by, + BlockNumber extend_upto, + Buffer *buffers, + uint32 *extended_by) +{ + BlockNumber first_block; + IOContext io_context = IOContextForStrategy(strategy); + + LimitAdditionalPins(&extend_by); + + /* + * Acquire victim buffers for extension without holding extension lock. + * Writing out victim buffers is the most expensive part of extending the + * relation, particularly when doing so requires WAL flushes. Zeroing out + * the buffers is also quite expensive, so do that before holding the + * extension lock as well. + * + * These pages are pinned by us and not valid. While we hold the pin they + * can't be acquired as victim buffers by another backend. + */ + for (uint32 i = 0; i < extend_by; i++) + { + Block buf_block; + + buffers[i] = GetVictimBuffer(strategy, io_context); + buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1)); + + /* new buffers are zero-filled */ + MemSet((char *) buf_block, 0, BLCKSZ); + } + + /* + * Lock relation against concurrent extensions, unless requested not to. + * + * We use the same extension lock for all forks. That's unnecessarily + * restrictive, but currently extensions for forks don't happen often + * enough to make it worth locking more granularly. + * + * Note that another backend might have extended the relation by the time + * we get the lock. + */ + if (!(flags & EB_SKIP_EXTENSION_LOCK)) + { + LockRelationForExtension(eb.rel, ExclusiveLock); + eb.smgr = RelationGetSmgr(eb.rel); + } + + /* + * If requested, invalidate size cache, so that smgrnblocks asks the + * kernel. + */ + if (flags & EB_CLEAR_SIZE_CACHE) + eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber; I don't see this in master, is it new? + first_block = smgrnblocks(eb.smgr, fork); + The below needs a better comment explaining what it is handling. e.g. if we end up extending by less than we planned, unpin all of the surplus victim buffers. + if (extend_upto != InvalidBlockNumber) + { + uint32 old_num_pages = extend_by; maybe call this something like original_extend_by diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 5b44b0be8b..0528fddf99 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c +BlockNumber +ExtendBufferedRelLocal(ExtendBufferedWhat eb, + ForkNumber fork, + uint32 flags, + uint32 extend_by, + BlockNumber extend_upto, + Buffer *buffers, + uint32 *extended_by) +{ + victim_buf_id = -(buffers[i] + 1); same comment here as before. + * Flags influencing the behaviour of ExtendBufferedRel* + */ +typedef enum ExtendBufferedFlags +{ + /* + * Don't acquire extension lock. This is safe only if the relation isn't + * shared, an access exclusive lock is held or if this is the startup + * process. + */ + EB_SKIP_EXTENSION_LOCK = (1 << 0), + + /* Is this extension part of recovery? */ + EB_PERFORMING_RECOVERY = (1 << 1), + + /* + * Should the fork be created if it does not currently exist? This likely + * only ever makes sense for relation forks. + */ + EB_CREATE_FORK_IF_NEEDED = (1 << 2), + + /* Should the first (possibly only) return buffer be returned locked? */ + EB_LOCK_FIRST = (1 << 3), + + /* Should the smgr size cache be cleared? */ + EB_CLEAR_SIZE_CACHE = (1 << 4), + + /* internal flags follow */ I don't understand what this comment means ("internal flags follow") + EB_LOCK_TARGET = (1 << 5), +} ExtendBufferedFlags; +typedef struct ExtendBufferedWhat Maybe this should be called like BufferedExtendTarget or something? +{ + Relation rel; + struct SMgrRelationData *smgr; + char relpersistence; +} ExtendBufferedWhat; From e4438c0eb87035e4cefd1de89458a8d88c90c0e3 Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Sun, 23 Oct 2022 14:44:43 -0700 Subject: [PATCH v5 11/15] heapam: Add num_pages to RelationGetBufferForTuple() This will be useful to compute the number of pages to extend a relation by. diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index cf4b917eb4..500904897d 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2050,6 +2053,33 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid, return tup; } +/* + * Helper for heap_multi_insert() that computes the number of full pages s no space after page before s + */ +static int +heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveFreeSpace) +{ + size_t page_avail; + int npages = 0; + + page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace; + npages++; can this not just be this: size_t page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace; int npages = 1; From 5d2be27caf8f4ee8f26841b2aa1674c90bd51754 Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Wed, 26 Oct 2022 14:14:11 -0700 Subject: [PATCH v5 12/15] hio: Use ExtendBufferedRelBy() --- src/backend/access/heap/hio.c | 285 +++++++++++++++++----------------- 1 file changed, 146 insertions(+), 139 deletions(-) diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 65886839e7..48cfcff975 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len, so in RelationGetBufferForTuple() up above where your changes start, there is this code /* * We first try to put the tuple on the same page we last inserted a tuple * on, as cached in the BulkInsertState or relcache entry. If that * doesn't work, we ask the Free Space Map to locate a suitable page. * Since the FSM's info might be out of date, we have to be prepared to * loop around and retry multiple times. (To insure this isn't an infinite * loop, we must update the FSM with the correct amount of free space on * each page that proves not to be suitable.) If the FSM has no record of * a page with enough free space, we give up and extend the relation. * * When use_fsm is false, we either put the tuple onto the existing target * page or extend the relation. */ if (bistate && bistate->current_buf != InvalidBuffer) { targetBlock = BufferGetBlockNumber(bistate->current_buf); } else targetBlock = RelationGetTargetBlock(relation); if (targetBlock == InvalidBlockNumber && use_fsm) { /* * We have no cached target page, so ask the FSM for an initial * target. */ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace); } And, I was thinking how, ReadBufferBI() only has one caller now (RelationGetBufferForTuple()) and, this caller basically already has checked for the case in the inside of ReadBufferBI() (the code I pasted above) /* If we have the desired block already pinned, re-pin and return it */ if (bistate->current_buf != InvalidBuffer) { if (BufferGetBlockNumber(bistate->current_buf) == targetBlock) { /* * Currently the LOCK variants are only used for extending * relation, which should never reach this branch. */ Assert(mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK); IncrBufferRefCount(bistate->current_buf); return bistate->current_buf; } /* ... else drop the old buffer */ So, I was thinking maybe there is some way to inline the logic for ReadBufferBI(), because I think it would feel more streamlined to me. @@ -558,18 +477,46 @@ loop: ReleaseBuffer(buffer); } Oh, and I forget which commit introduced BulkInsertState->next_free and last_free, but I remember thinking that it didn't seem to fit with the other parts of that commit. - /* Without FSM, always fall out of the loop and extend */ - if (!use_fsm) - break; + if (bistate + && bistate->next_free != InvalidBlockNumber + && bistate->next_free <= bistate->last_free) + { + /* + * We bulk extended the relation before, and there are still some + * unused pages from that extension, so we don't need to look in + * the FSM for a new page. But do record the free space from the + * last page, somebody might insert narrower tuples later. + */ Why couldn't we have found out that we bulk-extended before and get the block from there up above the while loop? + if (use_fsm) + RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace); - /* - * Update FSM as to condition of this page, and ask for another page - * to try. - */ - targetBlock = RecordAndGetPageWithFreeSpace(relation, - targetBlock, - pageFreeSpace, - targetFreeSpace); + Assert(bistate->last_free != InvalidBlockNumber && You don't need the below half of the assert. + bistate->next_free <= bistate->last_free); + targetBlock = bistate->next_free; + if (bistate->next_free >= bistate->last_free) they can only be equal at this point + { + bistate->next_free = InvalidBlockNumber; + bistate->last_free = InvalidBlockNumber; + } + else + bistate->next_free++; + } + else if (!use_fsm) + { + /* Without FSM, always fall out of the loop and extend */ + break; + } It would be nice to have a comment explaining why this is in its own else if instead of breaking earlier (i.e. !use_fsm is still a valid case in the if branch above it) + else + { + /* + * Update FSM as to condition of this page, and ask for another + * page to try. + */ + targetBlock = RecordAndGetPageWithFreeSpace(relation, + targetBlock, + pageFreeSpace, + targetFreeSpace); + } we can get rid of needLock and waitcount variables like this +#define MAX_BUFFERS 64 + Buffer victim_buffers[MAX_BUFFERS]; + BlockNumber firstBlock = InvalidBlockNumber; + BlockNumber firstBlockFSM = InvalidBlockNumber; + BlockNumber curBlock; + uint32 extend_by_pages; + uint32 no_fsm_pages; + uint32 waitcount; + + extend_by_pages = num_pages; + + /* + * Multiply the number of pages to extend by the number of waiters. Do + * this even if we're not using the FSM, as it does relieve + * contention. Pages will be found via bistate->next_free. + */ + if (needLock) + waitcount = RelationExtensionLockWaiterCount(relation); + else + waitcount = 0; + extend_by_pages += extend_by_pages * waitcount; if (!RELATION_IS_LOCAL(relation)) extend_by_pages += extend_by_pages * RelationExtensionLockWaiterCount(relation); + + /* + * can't extend by more than MAX_BUFFERS, we need to pin them all + * concurrently. FIXME: Need an NBuffers / MaxBackends type limit + * here. + */ + extend_by_pages = Min(extend_by_pages, MAX_BUFFERS); + + /* + * How many of the extended pages not to enter into the FSM. + * + * Only enter pages that we don't need ourselves into the FSM. + * Otherwise every other backend will immediately try to use the pages + * this backend neds itself, causing unnecessary contention. + * + * Bulk extended pages are remembered in bistate->next_free_buffer. So + * without a bistate we can't directly make use of them. + * + * Never enter the page returned into the FSM, we'll immediately use + * it. + */ + if (num_pages > 1 && bistate == NULL) + no_fsm_pages = 1; + else + no_fsm_pages = num_pages; this is more clearly this: no_fsm_pages = bistate == NULL ? 1 : num_pages; - /* - * Release the file-extension lock; it's now OK for someone else to extend - * the relation some more. - */ - if (needLock) - UnlockRelationForExtension(relation, ExclusiveLock); + if (bistate) + { + if (extend_by_pages > 1) + { + bistate->next_free = firstBlock + 1; + bistate->last_free = firstBlock + extend_by_pages - 1; + } + else + { + bistate->next_free = InvalidBlockNumber; + bistate->last_free = InvalidBlockNumber; + } + } + + buffer = victim_buffers[0]; If we move buffer = up, we can have only one if (bistate) + if (bistate) + { + IncrBufferRefCount(buffer); + bistate->current_buf = buffer; + } + } like this: buffer = victim_buffers[0]; if (bistate) { if (extend_by_pages > 1) { bistate->next_free = firstBlock + 1; bistate->last_free = firstBlock + extend_by_pages - 1; } else { bistate->next_free = InvalidBlockNumber; bistate->last_free = InvalidBlockNumber; } IncrBufferRefCount(buffer); bistate->current_buf = buffer; } From 6711e45bed59ee07ec277b9462f4745603a3d4a4 Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@anarazel.de> Date: Sun, 23 Oct 2022 14:41:46 -0700 Subject: [PATCH v5 15/15] bufmgr: debug: Add PrintBuffer[Desc] Useful for development. Perhaps we should polish these and keep them? diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 4e07a5bc48..0d382cd787 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c + + fprintf(stderr, "%d: [%u] msg: %s, rel: %s, block %u: refcount: %u / %u, usagecount: %u, flags:%s%s%s%s%s%s%s%s%s%s\n", + MyProcPid, + buffer, + msg, + path, + blockno, + BUF_STATE_GET_REFCOUNT(buf_state), + GetPrivateRefCount(buffer), + BUF_STATE_GET_USAGECOUNT(buf_state), + buf_state & BM_LOCKED ? " BM_LOCKED" : "", + buf_state & BM_DIRTY ? " BM_DIRTY" : "", + buf_state & BM_VALID ? " BM_VALID" : "", + buf_state & BM_TAG_VALID ? " BM_TAG_VALID" : "", + buf_state & BM_IO_IN_PROGRESS ? " BM_IO_IN_PROGRESS" : "", + buf_state & BM_IO_ERROR ? " BM_IO_ERROR" : "", + buf_state & BM_JUST_DIRTIED ? " BM_JUST_DIRTIED" : "", + buf_state & BM_PIN_COUNT_WAITER ? " BM_PIN_COUNT_WAITER" : "", + buf_state & BM_CHECKPOINT_NEEDED ? " BM_CHECKPOINT_NEEDED" : "", + buf_state & BM_PERMANENT ? " BM_PERMANENT" : "" + ); +} How about this #define FLAG_DESC(flag) (buf_state & (flag) ? " " #flag : "") FLAG_DESC(BM_LOCKED), FLAG_DESC(BM_DIRTY), FLAG_DESC(BM_VALID), FLAG_DESC(BM_TAG_VALID), FLAG_DESC(BM_IO_IN_PROGRESS), FLAG_DESC(BM_IO_ERROR), FLAG_DESC(BM_JUST_DIRTIED), FLAG_DESC(BM_PIN_COUNT_WAITER), FLAG_DESC(BM_CHECKPOINT_NEEDED), FLAG_DESC(BM_PERMANENT) #undef FLAG_DESC + +void +PrintBuffer(Buffer buffer, const char *msg) +{ + BufferDesc *buf_hdr = GetBufferDescriptor(buffer - 1); no need for this variable + + PrintBufferDesc(buf_hdr, msg); +} - Melanie
At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund <andres@anarazel.de> wrote in > Hi, > > Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the > relation extension related code in hio.c back into its own function. > > While reviewing the hio.c code, I did realize that too much stuff is done > while holding the buffer lock. See also the pre-existing issue > https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de 0001, 0002 looks fine to me. 0003 adds the new function FileFallocte, but we already have AllocateFile. Although fd.c contains functions with varying word orders, it could be confusing that closely named functions have different naming conventions. + /* + * Return in cases of a "real" failure, if fallocate is not supported, + * fall through to the FileZero() backed implementation. + */ + if (returnCode != EINVAL && returnCode != EOPNOTSUPP) + return returnCode; I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can be returned. Some googling indicate that ENOSYS might need the same amendment to EOPNOTSUPP. However, I'm not clear on why man posix_fallocate donsn't mention the former. + (returnCode != EINVAL && returnCode != EINVAL)) :) FileGetRawDesc(File file) { Assert(FileIsValid(file)); + + if (FileAccess(file) < 0) + return -1; + The function's comment is provided below. > * The returned file descriptor will be valid until the file is closed, but > * there are a lot of things that can make that happen. So the caller should > * be careful not to do much of anything else before it finishes using the > * returned file descriptor. So, the responsibility to make sure the file is valid seems to lie with the callers, although I'm not sure since there aren't any function users in the tree. I'm unclear as to why FileSize omits the case lruLessRecently != file. When examining similar functions, such as FileGetRawFlags and FileGetRawMode, I'm puzzled to find that FileAccess() nor BasicOpenFilePermthe don't set the struct members referred to by the functions. This makes my question the usefulness of these functions including FileGetRawDesc(). Regardless, since the patchset doesn't use FileGetRawDesc(), I don't believe the fix is necessary in this patch set. + if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber) I'm not sure it is appropriate to assume InvalidBlockNumber equals MaxBlockNumber + 1 in this context. + int segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE); + int segendblock = (curblocknum % ((BlockNumber) RELSEG_SIZE)) + remblocks; + off_t seekpos = (off_t) BLCKSZ * segstartblock; segendblock can be defined as "segstartblock + remblocks", which would be clearer. + * If available and useful, use posix_fallocate() (via FileAllocate()) FileFallocate()? + * However, we don't use FileAllocate() for small extensions, as it + * defeats delayed allocation on some filesystems. Not clear where + * that decision should be made though? For now just use a cutoff of + * 8, anything between 4 and 8 worked OK in some local testing. The chose is quite similar to what FileFallocate() makes. However, I'm not sure FileFallocate() itself should be doing this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote: > Below is my review of a slightly older version than you just posted -- > much of it you may have already addressed. Far from all is already - thanks for the review! > From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001 > From: Andres Freund <andres@anarazel.de> > Date: Mon, 20 Mar 2023 21:57:40 -0700 > Subject: [PATCH v5 01/15] createdb-using-wal-fixes > > This could use a more detailed commit message -- I don't really get what > it is doing It's a fix for a bug that I encountered while hacking on this, it since has been committed. commit 5df319f3d55d09fadb4f7e4b58c5b476a3aeceb4 Author: Andres Freund <andres@anarazel.de> Date: 2023-03-20 21:57:40 -0700 Fix memory leak and inefficiency in CREATE DATABASE ... STRATEGY WAL_LOG RelationCopyStorageUsingBuffer() did not free the strategies used to access the source / target relation. They memory was released at the end of the transaction, but when using a template database with a lot of relations, the temporary leak can become big prohibitively big. RelationCopyStorageUsingBuffer() acquired the buffer for the target relation with RBM_NORMAL, therefore requiring a read of a block guaranteed to be zero. Use RBM_ZERO_AND_LOCK instead. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/20230321070113.o2vqqxogjykwgfrr@awork3.anarazel.de Backpatch: 15-, where STRATEGY WAL_LOG was introduced > From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001 > From: Andres Freund <andres@anarazel.de> > Date: Wed, 1 Jul 2020 19:06:45 -0700 > Subject: [PATCH v5 02/15] Add some error checking around pinning > > --- > src/backend/storage/buffer/bufmgr.c | 40 ++++++++++++++++++++--------- > src/include/storage/bufmgr.h | 1 + > 2 files changed, 29 insertions(+), 12 deletions(-) > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index 95212a3941..fa20fab5a2 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer) > LW_EXCLUSIVE); > } > > +void > +BufferCheckOneLocalPin(Buffer buffer) > +{ > + if (BufferIsLocal(buffer)) > + { > + /* There should be exactly one pin */ > + if (LocalRefCount[-buffer - 1] != 1) > + elog(ERROR, "incorrect local pin count: %d", > + LocalRefCount[-buffer - 1]); > + } > + else > + { > + /* There should be exactly one local pin */ > + if (GetPrivateRefCount(buffer) != 1) > > I'd rather this be an else if (was already like this, but, no reason not > to change it now) I don't like that much - it'd break the symmetry between local / non-local. > +/* > + * Zero a region of the file. > + * > + * Returns 0 on success, -1 otherwise. In the latter case errno is set to the > + * appropriate error. > + */ > +int > +FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info) > +{ > + int returnCode; > + ssize_t written; > + > + Assert(FileIsValid(file)); > + returnCode = FileAccess(file); > + if (returnCode < 0) > + return returnCode; > + > + pgstat_report_wait_start(wait_event_info); > + written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset); > + pgstat_report_wait_end(); > + > + if (written < 0) > + return -1; > + else if (written != amount) > > this doesn't need to be an else if You mean it could be a "bare" if instead? I don't really think that's clearer. > + { > + /* if errno is unset, assume problem is no disk space */ > + if (errno == 0) > + errno = ENOSPC; > + return -1; > + } > > +int > +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info) > +{ > +#ifdef HAVE_POSIX_FALLOCATE > + int returnCode; > + > + Assert(FileIsValid(file)); > + returnCode = FileAccess(file); > + if (returnCode < 0) > + return returnCode; > + > + pgstat_report_wait_start(wait_event_info); > + returnCode = posix_fallocate(VfdCache[file].fd, offset, amount); > + pgstat_report_wait_end(); > + > + if (returnCode == 0) > + return 0; > + > + /* for compatibility with %m printing etc */ > + errno = returnCode; > + > + /* > + * Return in cases of a "real" failure, if fallocate is not supported, > + * fall through to the FileZero() backed implementation. > + */ > + if (returnCode != EINVAL && returnCode != EOPNOTSUPP) > + return returnCode; > > I'm pretty sure you can just delete the below if statement > > + if (returnCode == 0 || > + (returnCode != EINVAL && returnCode != EINVAL)) > + return returnCode; Hm. I don't see how - wouldn't that lead us to call FileZero(), even if FileFallocate() succeeded or failed (rather than not being supported)? > > +/* > + * mdzeroextend() -- Add ew zeroed out blocks to the specified relation. > > not sure what ew is A hurried new :) > + * > + * Similar to mdrextend(), except the relation can be extended by > > mdrextend->mdextend > + * multiple blocks at once, and that the added blocks will be > filled with > > I would lose the comma and just say "and the added blocks will be filled..." Done. > +void > +mdzeroextend(SMgrRelation reln, ForkNumber forknum, > + BlockNumber blocknum, int nblocks, bool skipFsync) > > So, I think there are a few too many local variables in here, and it > actually makes it more confusing. > Assuming you would like to keep the input parameters blocknum and > nblocks unmodified for debugging/other reasons, here is a suggested > refactor of this function I'm mostly adopting this. > Also, I think you can combine the two error cases (I don't know if the > user cares what you were trying to extend the file with). Hm. I do find it a somewhat useful distinction for figuring out problems - we haven't used posix_fallocate for files so far, it seems plausible we'd hit some portability issues. We could make it an errdetail(), I guess? > void > mdzeroextend(SMgrRelation reln, ForkNumber forknum, > BlockNumber blocknum, int nblocks, bool skipFsync) > { > MdfdVec *v; > BlockNumber curblocknum = blocknum; > int remblocks = nblocks; > Assert(nblocks > 0); > > /* This assert is too expensive to have on normally ... */ > #ifdef CHECK_WRITE_VS_EXTEND > Assert(blocknum >= mdnblocks(reln, forknum)); > #endif > > /* > * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any > * more --- we mustn't create a block whose number actually is > * InvalidBlockNumber or larger. > */ > if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber) > ereport(ERROR, > (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), > errmsg("cannot extend file \"%s\" beyond %u blocks", > relpath(reln->smgr_rlocator, forknum), > InvalidBlockNumber))); > > while (remblocks > 0) > { > int segstartblock = curblocknum % ((BlockNumber) > RELSEG_SIZE); Hm - this shouldn't be an int - that's my fault, not yours... > From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001 > From: Andres Freund <andres@anarazel.de> > Date: Wed, 26 Oct 2022 12:05:07 -0700 > Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer() > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index fa20fab5a2..6f50dbd212 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer) > } > > void > -BufferCheckOneLocalPin(Buffer buffer) > +BufferCheckWePinOnce(Buffer buffer) > > This name is weird. Who is we? The current backend. I.e. the function checks the current backend pins the buffer exactly once, rather that *any* backend pins it once. I now see that BufferIsPinned() is named, IMO, misleadingly, more generally, even though it also just applies to pins by the current backend. > diff --git a/src/backend/storage/buffer/localbuf.c > b/src/backend/storage/buffer/localbuf.c > index 5325ddb663..798c5b93a8 100644 > --- a/src/backend/storage/buffer/localbuf.c > +++ b/src/backend/storage/buffer/localbuf.c > +bool > +PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount) > +{ > + uint32 buf_state; > + Buffer buffer = BufferDescriptorGetBuffer(buf_hdr); > + int bufid = -(buffer + 1); > > You do > int buffid = -buffer - 1; > in UnpinLocalBuffer() > They should be consistent. > > int bufid = -(buffer + 1); > > I think this version is better: > > int buffid = -buffer - 1; > > Since if buffer is INT_MAX, then the -(buffer + 1) version invokes > undefined behavior while the -buffer - 1 version doesn't. You are right! Not sure what I was doing there... Ah - turns out there's pre-existing code in localbuf.c that do it that way :( See at least MarkLocalBufferDirty(). We really need to wrap this in something, rather than open coding it all over bufmgr.c/localbuf.c. I really dislike this indexing :(. > From a0228218e2ac299aac754eeb5b2be7ddfc56918d Mon Sep 17 00:00:00 2001 > From: Andres Freund <andres@anarazel.de> > Date: Fri, 17 Feb 2023 18:26:34 -0800 > Subject: [PATCH v5 07/15] bufmgr: Acquire and clean victim buffer separately > > Previously we held buffer locks for two buffer mapping partitions at the same > time to change the identity of buffers. Particularly for extending relations > needing to hold the extension lock while acquiring a victim buffer is > painful. By separating out the victim buffer acquisition, future commits will > be able to change relation extensions to scale better. > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index 3d0683593f..ea423ae484 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char > relpersistence, ForkNumber forkNum, > > /* > * Buffer contents are currently invalid. Try to obtain the right to > * start I/O. If StartBufferIO returns false, then someone else managed > * to read it before we did, so there's nothing left for BufferAlloc() to > * do. > */ > - if (StartBufferIO(buf, true)) > + if (StartBufferIO(victim_buf_hdr, true)) > *foundPtr = false; > else > *foundPtr = true; > > I know it was already like this, but since you edited the line already, > can we just make this this now? > > *foundPtr = !StartBufferIO(victim_buf_hdr, true); Hm, I do think it's easier to review if largely unchanged code is just moved around, rather also rewritten. So I'm hesitant to do that here. > @@ -1595,6 +1413,237 @@ retry: > StrategyFreeBuffer(buf); > } > > +/* > + * Helper routine for GetVictimBuffer() > + * > + * Needs to be called on a buffer with a valid tag, pinned, but without the > + * buffer header spinlock held. > + * > + * Returns true if the buffer can be reused, in which case the buffer is only > + * pinned by this backend and marked as invalid, false otherwise. > + */ > +static bool > +InvalidateVictimBuffer(BufferDesc *buf_hdr) > +{ > + /* > + * Clear out the buffer's tag and flags and usagecount. This is not > + * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before > + * doing anything with the buffer. But currently it's beneficial as the > + * pre-check for several linear scans of shared buffers just checks the > + * tag. > > I don't really understand the above comment -- mainly the last sentence. To start with, it's s/checks/check/ "linear scans" is a reference to functions like DropRelationBuffers(), which iterate over all buffers, and just check the tag for a match. If we leave the tag around, it'll still work, as InvalidateBuffer() etc will figure out that the buffer is invalid. But of course that's slower then just skipping the buffer "early on". > +static Buffer > +GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context) > +{ > + BufferDesc *buf_hdr; > + Buffer buf; > + uint32 buf_state; > + bool from_ring; > + > + /* > + * Ensure, while the spinlock's not yet held, that there's a free refcount > + * entry. > + */ > + ReservePrivateRefCountEntry(); > + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); > + > + /* we return here if a prospective victim buffer gets used concurrently */ > +again: > > Why use goto instead of a loop here (again is the goto label)? I find it way more readable this way. I'd use a loop if it were the common case to loop, but it's the rare case, and for that I find the goto more readable. > @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool > clear_dirty, uint32 set_flag_bits) > { > uint32 buf_state; > > I noticed that the comment above TermianteBufferIO() says > * TerminateBufferIO: release a buffer we were doing I/O on > * (Assumptions) > * My process is executing IO for the buffer > > Can we still say this is an assumption? What about when it is being > cleaned up after being called from AbortBufferIO() That hasn't really changed - it was already called by AbortBufferIO(). I think it's still correct, too. We must have marked the IO as being in progress to get there. > diff --git a/src/backend/utils/resowner/resowner.c > b/src/backend/utils/resowner/resowner.c > index 19b6241e45..fccc59b39d 100644 > --- a/src/backend/utils/resowner/resowner.c > +++ b/src/backend/utils/resowner/resowner.c > @@ -121,6 +121,7 @@ typedef struct ResourceOwnerData > > /* We have built-in support for remembering: */ > ResourceArray bufferarr; /* owned buffers */ > + ResourceArray bufferioarr; /* in-progress buffer IO */ > ResourceArray catrefarr; /* catcache references */ > ResourceArray catlistrefarr; /* catcache-list pins */ > ResourceArray relrefarr; /* relcache references */ > @@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) > > Maybe worth mentioning in-progress buffer IO in resowner README? I know > it doesn't claim to be exhaustive, so, up to you. Hm. Given the few types of resources mentioned in the README, I don't think it's worth doing so. > Also, I realize that existing code in this file has the extraneous > parantheses, but maybe it isn't worth staying consistent with that? > as in: &(owner->bufferioarr) I personally don't find it worth being consistent with that, but if you / others think it is, I'd be ok with adapting to that. > From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001 > From: Andres Freund <andres@anarazel.de> > Date: Wed, 1 Mar 2023 13:24:19 -0800 > Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into > ExtendBufferedRel{By,To,} > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index 3c95b87bca..4e07a5bc48 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > > +/* > + * Extend relation by multiple blocks. > + * > + * Tries to extend the relation by extend_by blocks. Depending on the > + * availability of resources the relation may end up being extended by a > + * smaller number of pages (unless an error is thrown, always by at least one > + * page). *extended_by is updated to the number of pages the relation has been > + * extended to. > + * > + * buffers needs to be an array that is at least extend_by long. Upon > + * completion, the first extend_by array elements will point to a pinned > + * buffer. > + * > + * If EB_LOCK_FIRST is part of flags, the first returned buffer is > + * locked. This is useful for callers that want a buffer that is guaranteed to > + * be empty. > > This should document what the returned BlockNumber is. Ok. > Also, instead of having extend_by and extended_by, how about just having > one which is set by the caller to the desired number to extend by and > then overwritten in this function to the value it successfully extended > by. I had it that way at first - but I think it turned out to be more confusing. > It would be nice if the function returned the number it extended by > instead of the BlockNumber. It's not actually free to get the block number from a buffer (it causes more sharing of the BufferDesc cacheline, which then makes modifications of the cacheline more expensive). We should work on removing all those BufferGetBlockNumber(). So I don't want to introduce a new function that requires using BufferGetBlockNumber(). So I don't think this would be an improvement. > + Assert((eb.rel != NULL) ^ (eb.smgr != NULL)); > > Can we turn these into != > > Assert((eb.rel != NULL) != (eb.smgr != NULL)); > > since it is easier to understand. Done. > + * Extend the relation so it is at least extend_to blocks large, read buffer > > Use of "read buffer" here is confusing. We only read the block if, after > we try extending the relation, someone else already did so and we have > to read the block they extended in, right? That's one case, yes. I think there's also some unfortunate other case that I'd like to get rid of. See my archeology at https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de > + uint32 num_pages = lengthof(buffers); > + BlockNumber first_block; > + > + if ((uint64) current_size + num_pages > extend_to) > + num_pages = extend_to - current_size; > + > + first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags, > + num_pages, extend_to, > + buffers, &extended_by); > + > + current_size = first_block + extended_by; > + Assert(current_size <= extend_to); > + Assert(num_pages != 0 || current_size >= extend_to); > + > + for (int i = 0; i < extended_by; i++) > + { > + if (first_block + i != extend_to - 1) > > Is there a way we could avoid pinning these other buffers to begin with > (e.g. passing a parameter to ExtendBufferedRelCommon()) We can't avoid pinning them. We could make ExtendBufferedRelCommon() release them though - but I'm not sure that'd be an improvement. I actually had a flag for that temporarily, but > + if (buffer == InvalidBuffer) > + { > + bool hit; > + > + Assert(extended_by == 0); > + buffer = ReadBuffer_common(eb.smgr, eb.relpersistence, > + fork, extend_to - 1, mode, strategy, > + &hit); > + } > + > + return buffer; > +} > > Do we use compound literals? Here, this could be: > > buffer = ReadBuffer_common(eb.smgr, eb.relpersistence, > fork, extend_to - 1, mode, strategy, > &(bool) {0}); > > To eliminate the extraneous hit variable. We do use compound literals in a few places. However, I don't think it's a good idea to pass a pointer to a temporary. At least I need to look up the lifetime rules for those every time. And this isn't a huge win, so I wouldn't go for it here. > /* > * ReadBuffer_common -- common logic for all ReadBuffer variants > @@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char > relpersistence, ForkNumber forkNum, > bool found; > IOContext io_context; > IOObject io_object; > - bool isExtend; > bool isLocalBuf = SmgrIsTemp(smgr); > > *hit = false; > > + /* > + * Backward compatibility path, most code should use > + * ExtendRelationBuffered() instead, as acquiring the extension lock > + * inside ExtendRelationBuffered() scales a lot better. > > Think these are old function names in the comment Indeed. > +static BlockNumber > +ExtendBufferedRelShared(ExtendBufferedWhat eb, > + ForkNumber fork, > + BufferAccessStrategy strategy, > + uint32 flags, > + uint32 extend_by, > + BlockNumber extend_upto, > + Buffer *buffers, > + uint32 *extended_by) > +{ > + BlockNumber first_block; > + IOContext io_context = IOContextForStrategy(strategy); > + > + LimitAdditionalPins(&extend_by); > + > + /* > + * Acquire victim buffers for extension without holding extension lock. > + * Writing out victim buffers is the most expensive part of extending the > + * relation, particularly when doing so requires WAL flushes. Zeroing out > + * the buffers is also quite expensive, so do that before holding the > + * extension lock as well. > + * > + * These pages are pinned by us and not valid. While we hold the pin they > + * can't be acquired as victim buffers by another backend. > + */ > + for (uint32 i = 0; i < extend_by; i++) > + { > + Block buf_block; > + > + buffers[i] = GetVictimBuffer(strategy, io_context); > + buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1)); > + > + /* new buffers are zero-filled */ > + MemSet((char *) buf_block, 0, BLCKSZ); > + } > + > + /* > + * Lock relation against concurrent extensions, unless requested not to. > + * > + * We use the same extension lock for all forks. That's unnecessarily > + * restrictive, but currently extensions for forks don't happen often > + * enough to make it worth locking more granularly. > + * > + * Note that another backend might have extended the relation by the time > + * we get the lock. > + */ > + if (!(flags & EB_SKIP_EXTENSION_LOCK)) > + { > + LockRelationForExtension(eb.rel, ExclusiveLock); > + eb.smgr = RelationGetSmgr(eb.rel); > + } > + > + /* > + * If requested, invalidate size cache, so that smgrnblocks asks the > + * kernel. > + */ > + if (flags & EB_CLEAR_SIZE_CACHE) > + eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber; > > I don't see this in master, is it new? Not really - it's just elsewhere. See vm_extend() and fsm_extend(). I could move this part back into "Convert a few places to ExtendBufferedRelTo", but I doin't think that'd be better. > + * Flags influencing the behaviour of ExtendBufferedRel* > + */ > +typedef enum ExtendBufferedFlags > +{ > + /* > + * Don't acquire extension lock. This is safe only if the relation isn't > + * shared, an access exclusive lock is held or if this is the startup > + * process. > + */ > + EB_SKIP_EXTENSION_LOCK = (1 << 0), > + > + /* Is this extension part of recovery? */ > + EB_PERFORMING_RECOVERY = (1 << 1), > + > + /* > + * Should the fork be created if it does not currently exist? This likely > + * only ever makes sense for relation forks. > + */ > + EB_CREATE_FORK_IF_NEEDED = (1 << 2), > + > + /* Should the first (possibly only) return buffer be returned locked? */ > + EB_LOCK_FIRST = (1 << 3), > + > + /* Should the smgr size cache be cleared? */ > + EB_CLEAR_SIZE_CACHE = (1 << 4), > + > + /* internal flags follow */ > > I don't understand what this comment means ("internal flags follow") Hm - just that the flags defined subsequently are for internal use, not for callers to specify. > + */ > +static int > +heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, > Size saveFreeSpace) > +{ > + size_t page_avail; > + int npages = 0; > + > + page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace; > + npages++; > > can this not just be this: > > size_t page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace; > int npages = 1; Yes. > > From 5d2be27caf8f4ee8f26841b2aa1674c90bd51754 Mon Sep 17 00:00:00 2001 > From: Andres Freund <andres@anarazel.de> > Date: Wed, 26 Oct 2022 14:14:11 -0700 > Subject: [PATCH v5 12/15] hio: Use ExtendBufferedRelBy() > --- > src/backend/access/heap/hio.c | 285 +++++++++++++++++----------------- > 1 file changed, 146 insertions(+), 139 deletions(-) > > diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c > index 65886839e7..48cfcff975 100644 > --- a/src/backend/access/heap/hio.c > +++ b/src/backend/access/heap/hio.c > @@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len, > > so in RelationGetBufferForTuple() up above where your changes start, > there is this code > > > /* > * We first try to put the tuple on the same page we last inserted a tuple > * on, as cached in the BulkInsertState or relcache entry. If that > * doesn't work, we ask the Free Space Map to locate a suitable page. > * Since the FSM's info might be out of date, we have to be prepared to > * loop around and retry multiple times. (To insure this isn't an infinite > * loop, we must update the FSM with the correct amount of free space on > * each page that proves not to be suitable.) If the FSM has no record of > * a page with enough free space, we give up and extend the relation. > * > * When use_fsm is false, we either put the tuple onto the existing target > * page or extend the relation. > */ > if (bistate && bistate->current_buf != InvalidBuffer) > { > targetBlock = BufferGetBlockNumber(bistate->current_buf); > } > else > targetBlock = RelationGetTargetBlock(relation); > > if (targetBlock == InvalidBlockNumber && use_fsm) > { > /* > * We have no cached target page, so ask the FSM for an initial > * target. > */ > targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace); > } > > And, I was thinking how, ReadBufferBI() only has one caller now > (RelationGetBufferForTuple()) and, this caller basically already has > checked for the case in the inside of ReadBufferBI() (the code I pasted > above) > > /* If we have the desired block already pinned, re-pin and return it */ > if (bistate->current_buf != InvalidBuffer) > { > if (BufferGetBlockNumber(bistate->current_buf) == targetBlock) > { > /* > * Currently the LOCK variants are only used for extending > * relation, which should never reach this branch. > */ > Assert(mode != RBM_ZERO_AND_LOCK && > mode != RBM_ZERO_AND_CLEANUP_LOCK); > > IncrBufferRefCount(bistate->current_buf); > return bistate->current_buf; > } > /* ... else drop the old buffer */ > > So, I was thinking maybe there is some way to inline the logic for > ReadBufferBI(), because I think it would feel more streamlined to me. I don't really see how - I'd welcome suggestions? > @@ -558,18 +477,46 @@ loop: > ReleaseBuffer(buffer); > } > > Oh, and I forget which commit introduced BulkInsertState->next_free and > last_free, but I remember thinking that it didn't seem to fit with the > other parts of that commit. I'll move it into the one using ExtendBufferedRelBy(). > - /* Without FSM, always fall out of the loop and extend */ > - if (!use_fsm) > - break; > + if (bistate > + && bistate->next_free != InvalidBlockNumber > + && bistate->next_free <= bistate->last_free) > + { > + /* > + * We bulk extended the relation before, and there are still some > + * unused pages from that extension, so we don't need to look in > + * the FSM for a new page. But do record the free space from the > + * last page, somebody might insert narrower tuples later. > + */ > > Why couldn't we have found out that we bulk-extended before and get the > block from there up above the while loop? I'm not quite sure I follow - above the loop there might still have been space on the prior page? We also need the ability to loop if the space has been used since. I guess there's an argument for also checking above the loop, but I don't think that'd currently ever be reachable. > + { > + bistate->next_free = InvalidBlockNumber; > + bistate->last_free = InvalidBlockNumber; > + } > + else > + bistate->next_free++; > + } > + else if (!use_fsm) > + { > + /* Without FSM, always fall out of the loop and extend */ > + break; > + } > > It would be nice to have a comment explaining why this is in its own > else if instead of breaking earlier (i.e. !use_fsm is still a valid case > in the if branch above it) I'm not quite following. Breaking where earlier? Note that that branch is old code, it's just that a new way of getting a page that potentially has free space is preceding it. > we can get rid of needLock and waitcount variables like this > > +#define MAX_BUFFERS 64 > + Buffer victim_buffers[MAX_BUFFERS]; > + BlockNumber firstBlock = InvalidBlockNumber; > + BlockNumber firstBlockFSM = InvalidBlockNumber; > + BlockNumber curBlock; > + uint32 extend_by_pages; > + uint32 no_fsm_pages; > + uint32 waitcount; > + > + extend_by_pages = num_pages; > + > + /* > + * Multiply the number of pages to extend by the number of waiters. Do > + * this even if we're not using the FSM, as it does relieve > + * contention. Pages will be found via bistate->next_free. > + */ > + if (needLock) > + waitcount = RelationExtensionLockWaiterCount(relation); > + else > + waitcount = 0; > + extend_by_pages += extend_by_pages * waitcount; > > if (!RELATION_IS_LOCAL(relation)) > extend_by_pages += extend_by_pages * > RelationExtensionLockWaiterCount(relation); I guess I find it useful to be able to quickly add logging messages for stuff like this. I don't think local variables are as bad as you make them out to be :) Thanks for the review! Andres
Hi, On 2023-03-27 15:32:47 +0900, Kyotaro Horiguchi wrote: > At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund <andres@anarazel.de> wrote in > > Hi, > > > > Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the > > relation extension related code in hio.c back into its own function. > > > > While reviewing the hio.c code, I did realize that too much stuff is done > > while holding the buffer lock. See also the pre-existing issue > > https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de > > 0001, 0002 looks fine to me. > > 0003 adds the new function FileFallocte, but we already have > AllocateFile. Although fd.c contains functions with varying word > orders, it could be confusing that closely named functions have > different naming conventions. The syscall is named fallocate, I don't think we'd gain anything by inventing a different name for it? Given that there's a number of File$syscall operations, I think it's clear enough that it just fits into that. Unless you have a better proposal? > + /* > + * Return in cases of a "real" failure, if fallocate is not supported, > + * fall through to the FileZero() backed implementation. > + */ > + if (returnCode != EINVAL && returnCode != EOPNOTSUPP) > + return returnCode; > > I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can > be returned. Some googling indicate that ENOSYS might need the same > amendment to EOPNOTSUPP. However, I'm not clear on why man > posix_fallocate donsn't mention the former. posix_fallocate() and its errors are specified by posix, I guess. I think glibc etc will map ENOSYS to EOPNOTSUPP. I really dislike this bit from the posix_fallocate manpage: EINVAL offset was less than 0, or len was less than or equal to 0, or the underlying filesystem does not support the operation. Why oh why would you add the "or .." portion into EINVAL, when there's also EOPNOTSUPP? > > + (returnCode != EINVAL && returnCode != EINVAL)) > :) > > > FileGetRawDesc(File file) > { > Assert(FileIsValid(file)); > + > + if (FileAccess(file) < 0) > + return -1; > + > > The function's comment is provided below. > > > * The returned file descriptor will be valid until the file is closed, but > > * there are a lot of things that can make that happen. So the caller should > > * be careful not to do much of anything else before it finishes using the > > * returned file descriptor. > > So, the responsibility to make sure the file is valid seems to lie > with the callers, although I'm not sure since there aren't any > function users in the tree. Except, as I think you realized as well, external callers *can't* call FileAccess(), it's static. > I'm unclear as to why FileSize omits the case lruLessRecently != file. Not quite following - why would FileSize() deal with lruLessRecently itself? Or do you mean why FileSize() uses FileIsNotOpen() itself, rather than relying on FileAccess() doing that internally? > When examining similar functions, such as FileGetRawFlags and > FileGetRawMode, I'm puzzled to find that FileAccess() nor > BasicOpenFilePermthe don't set the struct members referred to by the > functions. Those aren't involved with LRU mechanism, IIRC. Note that BasicOpenFilePerm() returns an actual fd, not a File. So you can't call FileGetRawMode() on it. As BasicOpenFilePerm() says: * This is exported for use by places that really want a plain kernel FD, * but need to be proof against running out of FDs. ... I don't think FileAccess() needs to set those struct members, that's already been done in PathNameOpenFilePerm(). > This makes my question the usefulness of these functions including > FileGetRawDesc(). It's quite weird that we have FileGetRawDesc(), but don't allow to use it in a safe way... > Regardless, since the > patchset doesn't use FileGetRawDesc(), I don't believe the fix is > necessary in this patch set. Yea. It was used in an earlier version, but not anymore. > > + if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber) > > I'm not sure it is appropriate to assume InvalidBlockNumber equals > MaxBlockNumber + 1 in this context. Hm. This is just the multi-block equivalent of what mdextend() already does. It's not pretty, indeed. I'm not sure there's really a better thing to do for mdzeroextend(), given the mdextend() precedent? mdzeroextend() (just as mdextend()) will be called with blockNum == InvalidBlockNumber, if you try to extend past the size limit. > > + * However, we don't use FileAllocate() for small extensions, as it > + * defeats delayed allocation on some filesystems. Not clear where > + * that decision should be made though? For now just use a cutoff of > + * 8, anything between 4 and 8 worked OK in some local testing. > > The chose is quite similar to what FileFallocate() makes. However, I'm > not sure FileFallocate() itself should be doing this. I'm not following - there's no such choice in FileFallocate()? Do you mean that FileFallocate() also falls back to FileZero()? I don't think those are comparable. We don't want the code outside of fd.c to have to implement a fallback for platforms that don't have fallocate (or something similar), that's why FileFallocate() falls back to FileZero(). Here we care about not calling FileFallocate() in too small increments, when the relation might extend further. If we somehow knew in mdzeroextend() that the file won't be extended further, it'd be a good idea to call FileFallocate(), even if just for a single block - it'd prevent the kernel from wasting memory for delayed allocation. But unfortunately we don't know if it's the final size, hence the heuristic. Does that make sense? Thanks! Andres
On Tue, Mar 28, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote: > > + { > > + /* if errno is unset, assume problem is no disk space */ > > + if (errno == 0) > > + errno = ENOSPC; > > + return -1; > > + } > > > > +int > > +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info) > > +{ > > +#ifdef HAVE_POSIX_FALLOCATE > > + int returnCode; > > + > > + Assert(FileIsValid(file)); > > + returnCode = FileAccess(file); > > + if (returnCode < 0) > > + return returnCode; > > + > > + pgstat_report_wait_start(wait_event_info); > > + returnCode = posix_fallocate(VfdCache[file].fd, offset, amount); > > + pgstat_report_wait_end(); > > + > > + if (returnCode == 0) > > + return 0; > > + > > + /* for compatibility with %m printing etc */ > > + errno = returnCode; > > + > > + /* > > + * Return in cases of a "real" failure, if fallocate is not supported, > > + * fall through to the FileZero() backed implementation. > > + */ > > + if (returnCode != EINVAL && returnCode != EOPNOTSUPP) > > + return returnCode; > > > > I'm pretty sure you can just delete the below if statement > > > > + if (returnCode == 0 || > > + (returnCode != EINVAL && returnCode != EINVAL)) > > + return returnCode; > > Hm. I don't see how - wouldn't that lead us to call FileZero(), even if > FileFallocate() succeeded or failed (rather than not being supported)? Uh...I'm confused...maybe my eyes aren't working. If returnCode was 0, you already would have returned and if returnCode wasn't EINVAL, you also already would have returned. Not to mention (returnCode != EINVAL && returnCode != EINVAL) contains two identical operands. > > +void > > +mdzeroextend(SMgrRelation reln, ForkNumber forknum, > > + BlockNumber blocknum, int nblocks, bool skipFsync) > > > > So, I think there are a few too many local variables in here, and it > > actually makes it more confusing. > > Assuming you would like to keep the input parameters blocknum and > > nblocks unmodified for debugging/other reasons, here is a suggested > > refactor of this function > > I'm mostly adopting this. > > > > Also, I think you can combine the two error cases (I don't know if the > > user cares what you were trying to extend the file with). > > Hm. I do find it a somewhat useful distinction for figuring out problems - we > haven't used posix_fallocate for files so far, it seems plausible we'd hit > some portability issues. We could make it an errdetail(), I guess? I think that would be clearer. > > From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001 > > From: Andres Freund <andres@anarazel.de> > > Date: Wed, 26 Oct 2022 12:05:07 -0700 > > Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer() > > > > diff --git a/src/backend/storage/buffer/bufmgr.c > > b/src/backend/storage/buffer/bufmgr.c > > index fa20fab5a2..6f50dbd212 100644 > > --- a/src/backend/storage/buffer/bufmgr.c > > +++ b/src/backend/storage/buffer/bufmgr.c > > @@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer) > > } > > > > void > > -BufferCheckOneLocalPin(Buffer buffer) > > +BufferCheckWePinOnce(Buffer buffer) > > > > This name is weird. Who is we? > > The current backend. I.e. the function checks the current backend pins the > buffer exactly once, rather that *any* backend pins it once. > > I now see that BufferIsPinned() is named, IMO, misleadingly, more generally, > even though it also just applies to pins by the current backend. Maybe there is a way to use "self" instead of a pronoun? But, if you feel quite strongly about a pronoun, I think "we" implies more than one backend, so "I" would be better. > > @@ -1595,6 +1413,237 @@ retry: > > StrategyFreeBuffer(buf); > > } > > > > +/* > > + * Helper routine for GetVictimBuffer() > > + * > > + * Needs to be called on a buffer with a valid tag, pinned, but without the > > + * buffer header spinlock held. > > + * > > + * Returns true if the buffer can be reused, in which case the buffer is only > > + * pinned by this backend and marked as invalid, false otherwise. > > + */ > > +static bool > > +InvalidateVictimBuffer(BufferDesc *buf_hdr) > > +{ > > + /* > > + * Clear out the buffer's tag and flags and usagecount. This is not > > + * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before > > + * doing anything with the buffer. But currently it's beneficial as the > > + * pre-check for several linear scans of shared buffers just checks the > > + * tag. > > > > I don't really understand the above comment -- mainly the last sentence. > > To start with, it's s/checks/check/ > > "linear scans" is a reference to functions like DropRelationBuffers(), which > iterate over all buffers, and just check the tag for a match. If we leave the > tag around, it'll still work, as InvalidateBuffer() etc will figure out that > the buffer is invalid. But of course that's slower then just skipping the > buffer "early on". Ah. I see the updated comment on your branch and find it to be more clear. > > @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool > > clear_dirty, uint32 set_flag_bits) > > { > > uint32 buf_state; > > > > I noticed that the comment above TermianteBufferIO() says > > * TerminateBufferIO: release a buffer we were doing I/O on > > * (Assumptions) > > * My process is executing IO for the buffer > > > > Can we still say this is an assumption? What about when it is being > > cleaned up after being called from AbortBufferIO() > > That hasn't really changed - it was already called by AbortBufferIO(). > > I think it's still correct, too. We must have marked the IO as being in > progress to get there. Oh, no I meant the "my process is executing IO for the buffer" -- couldn't another backend clear IO_IN_PROGRESS (i.e. not the original one who set it on all the victim buffers)? > > Also, I realize that existing code in this file has the extraneous > > parantheses, but maybe it isn't worth staying consistent with that? > > as in: &(owner->bufferioarr) > > I personally don't find it worth being consistent with that, but if you / > others think it is, I'd be ok with adapting to that. Right, so I'm saying you should remove the extraneous parentheses in the code you added. > > From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001 > > From: Andres Freund <andres@anarazel.de> > > Date: Wed, 1 Mar 2023 13:24:19 -0800 > > Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into > > ExtendBufferedRel{By,To,} > > + * Flags influencing the behaviour of ExtendBufferedRel* > > + */ > > +typedef enum ExtendBufferedFlags > > +{ > > + /* > > + * Don't acquire extension lock. This is safe only if the relation isn't > > + * shared, an access exclusive lock is held or if this is the startup > > + * process. > > + */ > > + EB_SKIP_EXTENSION_LOCK = (1 << 0), > > + > > + /* Is this extension part of recovery? */ > > + EB_PERFORMING_RECOVERY = (1 << 1), > > + > > + /* > > + * Should the fork be created if it does not currently exist? This likely > > + * only ever makes sense for relation forks. > > + */ > > + EB_CREATE_FORK_IF_NEEDED = (1 << 2), > > + > > + /* Should the first (possibly only) return buffer be returned locked? */ > > + EB_LOCK_FIRST = (1 << 3), > > + > > + /* Should the smgr size cache be cleared? */ > > + EB_CLEAR_SIZE_CACHE = (1 << 4), > > + > > + /* internal flags follow */ > > > > I don't understand what this comment means ("internal flags follow") > > Hm - just that the flags defined subsequently are for internal use, not for > callers to specify. If EB_LOCK_TARGET is the only one of these, it might be more clear, for now, to just say "an internal flag" or "for internal use" above EB_LOCK_TARGET, since it is the only one. > > - /* Without FSM, always fall out of the loop and extend */ > > - if (!use_fsm) > > - break; > > + if (bistate > > + && bistate->next_free != InvalidBlockNumber > > + && bistate->next_free <= bistate->last_free) > > + { > > + /* > > + * We bulk extended the relation before, and there are still some > > + * unused pages from that extension, so we don't need to look in > > + * the FSM for a new page. But do record the free space from the > > + * last page, somebody might insert narrower tuples later. > > + */ > > > > Why couldn't we have found out that we bulk-extended before and get the > > block from there up above the while loop? > > I'm not quite sure I follow - above the loop there might still have been space > on the prior page? We also need the ability to loop if the space has been used > since. > > I guess there's an argument for also checking above the loop, but I don't > think that'd currently ever be reachable. My idea was that directly below this code in RelationGetBufferForTuple(): if (bistate && bistate->current_buf != InvalidBuffer) targetBlock = BufferGetBlockNumber(bistate->current_buf); else targetBlock = RelationGetTargetBlock(relation); We could check bistate->next_free instead of checking the freespace map first. But, perhaps we couldn't hit this because we would have already have set current_buf if we had a next_free? - Melanie
Hi, On 2023-03-29 20:51:04 -0400, Melanie Plageman wrote: > > > + if (returnCode == 0) > > > + return 0; > > > + > > > + /* for compatibility with %m printing etc */ > > > + errno = returnCode; > > > + > > > + /* > > > + * Return in cases of a "real" failure, if fallocate is not supported, > > > + * fall through to the FileZero() backed implementation. > > > + */ > > > + if (returnCode != EINVAL && returnCode != EOPNOTSUPP) > > > + return returnCode; > > > > > > I'm pretty sure you can just delete the below if statement > > > > > > + if (returnCode == 0 || > > > + (returnCode != EINVAL && returnCode != EINVAL)) > > > + return returnCode; > > > > Hm. I don't see how - wouldn't that lead us to call FileZero(), even if > > FileFallocate() succeeded or failed (rather than not being supported)? > > Uh...I'm confused...maybe my eyes aren't working. If returnCode was 0, > you already would have returned and if returnCode wasn't EINVAL, you > also already would have returned. > Not to mention (returnCode != EINVAL && returnCode != EINVAL) contains > two identical operands. I'm afraid it was not your eyes that weren't working... > > > void > > > -BufferCheckOneLocalPin(Buffer buffer) > > > +BufferCheckWePinOnce(Buffer buffer) > > > > > > This name is weird. Who is we? > > > > The current backend. I.e. the function checks the current backend pins the > > buffer exactly once, rather that *any* backend pins it once. > > > > I now see that BufferIsPinned() is named, IMO, misleadingly, more generally, > > even though it also just applies to pins by the current backend. > > Maybe there is a way to use "self" instead of a pronoun? But, if you > feel quite strongly about a pronoun, I think "we" implies more than one > backend, so "I" would be better. I have no strong feelings around this in any form :) > > > @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool > > > clear_dirty, uint32 set_flag_bits) > > > { > > > uint32 buf_state; > > > > > > I noticed that the comment above TermianteBufferIO() says > > > * TerminateBufferIO: release a buffer we were doing I/O on > > > * (Assumptions) > > > * My process is executing IO for the buffer > > > > > > Can we still say this is an assumption? What about when it is being > > > cleaned up after being called from AbortBufferIO() > > > > That hasn't really changed - it was already called by AbortBufferIO(). > > > > I think it's still correct, too. We must have marked the IO as being in > > progress to get there. > > Oh, no I meant the "my process is executing IO for the buffer" -- > couldn't another backend clear IO_IN_PROGRESS (i.e. not the original one > who set it on all the victim buffers)? No. Or at least not yet ;) - with AIO we will... Only the IO issuing backend currently is allowed to reset IO_IN_PROGRESS. > > > + /* Should the smgr size cache be cleared? */ > > > + EB_CLEAR_SIZE_CACHE = (1 << 4), > > > + > > > + /* internal flags follow */ > > > > > > I don't understand what this comment means ("internal flags follow") > > > > Hm - just that the flags defined subsequently are for internal use, not for > > callers to specify. > > If EB_LOCK_TARGET is the only one of these, it might be more clear, for > now, to just say "an internal flag" or "for internal use" above > EB_LOCK_TARGET, since it is the only one. I am quite certain it won't be the only one... > > > - /* Without FSM, always fall out of the loop and extend */ > > > - if (!use_fsm) > > > - break; > > > + if (bistate > > > + && bistate->next_free != InvalidBlockNumber > > > + && bistate->next_free <= bistate->last_free) > > > + { > > > + /* > > > + * We bulk extended the relation before, and there are still some > > > + * unused pages from that extension, so we don't need to look in > > > + * the FSM for a new page. But do record the free space from the > > > + * last page, somebody might insert narrower tuples later. > > > + */ > > > > > > Why couldn't we have found out that we bulk-extended before and get the > > > block from there up above the while loop? > > > > I'm not quite sure I follow - above the loop there might still have been space > > on the prior page? We also need the ability to loop if the space has been used > > since. > > > > I guess there's an argument for also checking above the loop, but I don't > > think that'd currently ever be reachable. > > My idea was that directly below this code in RelationGetBufferForTuple(): > > if (bistate && bistate->current_buf != InvalidBuffer) > targetBlock = BufferGetBlockNumber(bistate->current_buf); > else > targetBlock = RelationGetTargetBlock(relation); > > We could check bistate->next_free instead of checking the freespace map first. > > But, perhaps we couldn't hit this because we would have already have set > current_buf if we had a next_free? Correct. I think it might be worth doing a larger refactoring of that function at some point not too far away... It's definitely somewhat sad that we spend time locking the buffer, recheck pins etc, for the callers like heap_multi_insert() and heap_update() that already know that the page is full. But that seems like independent enough that I'd not tackle it now. Greetings, Andres Freund
Hi, Attached is v6. Changes: - Try to address Melanie and Horiguchi-san's review. I think there's one or two further things that need to be done - Avoided inserting newly extended pages into the FSM while holding a buffer lock. If we need to do so, we now drop the buffer lock and recheck if there still is space (very very likely). See also [1]. I use the infrastructure introduced over in that in this patchset. - Lots of comment and commit message polishing. More needed, particularly for the latter, but ... - Added a patch to fix the pre-existing undefined behaviour in localbuf.c that Melanie pointed out. Plan to commit that soon. - Added a patch to fix some pre-existing DO_DB() format code issues. Plan to commit that soon. I did some benchmarking on "bufmgr: Acquire and clean victim buffer separately" in isolation. For workloads that do a *lot* of reads, that proves to be a substantial benefit on its own. For the, obviously unrealistically extreme, workload of N backends doing SELECT pg_prewarm('pgbench_accounts', 'buffer'); in a scale 100 database (with a 1281MB pgbench_accounts) and shared_buffers of 128MB, I see > 2x gains at 128, 512 clients. Of course realistic workloads will have much smaller gains, but it's still good to see. Looking at the patchset, I am mostly happy with the breakdown into individual commits. However "bufmgr: Move relation extension handling into ExtendBufferedRel{By,To,}" is quite large. But I don't quite see how to break it into smaller pieces without making things awkward (e.g. due to static functions being unused, or temporarily duplicating the code doing relation extensions). Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de
Attachment
- v6-0001-bufmgr-Fix-undefined-behaviour-with-unrealistical.patch
- v6-0002-Fix-format-code-in-fd.c-debugging-infrastructure.patch
- v6-0003-hio-Release-extension-lock-before-initializing-pa.patch
- v6-0004-hio-Don-t-pin-the-VM-while-holding-buffer-lock-wh.patch
- v6-0005-bufmgr-Add-some-error-checking-around-pinning.patch
- v6-0006-Add-smgrzeroextend-FileZero-FileFallocate.patch
- v6-0007-bufmgr-Add-Pin-UnpinLocalBuffer.patch
- v6-0008-bufmgr-Remove-buffer-write-dirty-tracepoints.patch
- v6-0009-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
- v6-0010-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch
- v6-0011-bufmgr-Move-relation-extension-handling-into-Exte.patch
- v6-0012-Convert-a-few-places-to-ExtendBufferedRel.patch
- v6-0013-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch
- v6-0014-hio-Use-ExtendBufferedRelBy.patch
- v6-0015-WIP-Don-t-initialize-page-in-vm-fsm-_extend-not-n.patch
- v6-0016-Convert-a-few-places-to-ExtendBufferedRelTo.patch
On Thu, Mar 30, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> Attached is v6. Changes:
0006:
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+ FilePathName(v->mdfd_vfd)),
+ errhint("Check free disk space."));
--
John Naylor
EDB: http://www.enterprisedb.com
+ errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+ FilePathName(v->mdfd_vfd)),
+ errhint("Check free disk space."));
Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was used in FileFallocate().
--
John Naylor
EDB: http://www.enterprisedb.com
Hi, On 2023-03-30 12:28:57 +0700, John Naylor wrote: > On Thu, Mar 30, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote: > > Attached is v6. Changes: > > 0006: > > + ereport(ERROR, > + errcode_for_file_access(), > + errmsg("could not extend file \"%s\" with posix_fallocate(): > %m", > + FilePathName(v->mdfd_vfd)), > + errhint("Check free disk space.")); > > Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was > used in FileFallocate(). Fair point. I would however like to see a different error message for the two ways of extending, at least initially. What about referencing FileFallocate()? Greetings, Andres Freund
Hi, On 2023-03-29 20:02:33 -0700, Andres Freund wrote: > Attached is v6. Changes: Attached is v7. Not much in the way of changes: - polished a lot of the commit messages - reordered commits to not be blocked as immediately by https://www.postgresql.org/message-id/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de - Used the new relation extension function in two more places (finding an independent bug on the way), not sure why I didn't convert those earlier... I'm planning to push the patches up to the hio.c changes soon, unless somebody would like me to hold off. After that I'm planning to wait for a buildfarm cycle, and push the changes necessary to use bulk extension in hio.c (the main win). I might split the patch to use ExtendBufferedRelTo() into two, one for vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more complicated and has more of a complicated history (see [1]). Greetings, Andres Freund https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de
Attachment
- v7-0001-Don-t-initialize-page-in-vm-fsm-_extend-not-neede.patch
- v7-0002-Add-smgrzeroextend-FileZero-FileFallocate.patch
- v7-0003-bufmgr-Add-some-more-error-checking-infrastructur.patch
- v7-0004-bufmgr-Add-Pin-UnpinLocalBuffer.patch
- v7-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
- v7-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch
- v7-0007-bufmgr-Introduce-infrastructure-for-faster-relati.patch
- v7-0008-Convert-many-uses-of-ReadBuffer-Extended-P_NEW-wi.patch
- v7-0009-heapam-Pass-number-of-required-pages-to-RelationG.patch
- v7-0010-hio-Relax-rules-for-calling-GetVisibilityMapPins.patch
- v7-0011-hio-Don-t-pin-the-VM-while-holding-buffer-lock-wh.patch
- v7-0012-hio-Use-ExtendBufferedRelBy.patch
- v7-0013-Convert-a-few-places-to-ExtendBufferedRelTo.patch
On Wed, Apr 5, 2023 at 7:32 AM Andres Freund <andres@anarazel.de> wrote:
> > Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was
> > used in FileFallocate().
>
> Fair point. I would however like to see a different error message for the two
> ways of extending, at least initially. What about referencing FileFallocate()?
Seems logical.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi, On 2023-04-04 17:39:45 -0700, Andres Freund wrote: > I'm planning to push the patches up to the hio.c changes soon, unless somebody > would like me to hold off. Done that. > After that I'm planning to wait for a buildfarm cycle, and push the changes > necessary to use bulk extension in hio.c (the main win). Working on that. Might end up being tomorrow. > I might split the patch to use ExtendBufferedRelTo() into two, one for > vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more > complicated and has more of a complicated history (see [1]). I've pushed the vm_extend() and fsm_extend() piece, and did split out the xlogutils.c case. Greetings, Andres Freund
Hi, On 2023-04-05 18:46:16 -0700, Andres Freund wrote: > On 2023-04-04 17:39:45 -0700, Andres Freund wrote: > > After that I'm planning to wait for a buildfarm cycle, and push the changes > > necessary to use bulk extension in hio.c (the main win). > > Working on that. Might end up being tomorrow. It did. So far no complaints from the buildfarm. Although I pushed the last piece just now... Besides editing the commit message, not a lot of changes between what I posted last and what I pushed. A few typos and awkward sentences, code formatting, etc. I did change the API of RelationAddBlocks() to set *did_unlock = false, if it didn't unlock (and thus removed setting it in the caller). > > I might split the patch to use ExtendBufferedRelTo() into two, one for > > vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more > > complicated and has more of a complicated history (see [1]). > > I've pushed the vm_extend() and fsm_extend() piece, and did split out the > xlogutils.c case. Which I pushed, just now. I did perform manual testing with creating disconnected segments on the standby, and checking that everything behaves well in that case. I think it might be worth having a C test for some of the bufmgr.c API. Things like testing that retrying a failed relation extension works the second time round. Greetings, Andres Freund
Hi, On 2023-04-06 18:15:14 -0700, Andres Freund wrote: > I think it might be worth having a C test for some of the bufmgr.c API. Things > like testing that retrying a failed relation extension works the second time > round. A few hours after this I hit a stupid copy-pasto (21d7c05a5cf) that would hopefully have been uncovered by such a test... I guess we could even test this specific instance without a more complicated framework. Create table with some data, rename the file, checkpoint - should fail, rename back, checkpoint - should succeed. It's much harder to exercise the error paths inside the backend extending the relation unfortunately, because we require the file to be opened rw before doing much. And once the FD is open, removing the permissions doesn't help. The least complicated approach I scan see is creating directory qutoas, but that's quite file system specific... Greetings, Andres Freund
Hi Andres, 07.04.2023 11:39, Andres Freund wrote: > Hi, > > On 2023-04-06 18:15:14 -0700, Andres Freund wrote: >> I think it might be worth having a C test for some of the bufmgr.c API. Things >> like testing that retrying a failed relation extension works the second time >> round. > A few hours after this I hit a stupid copy-pasto (21d7c05a5cf) that would > hopefully have been uncovered by such a test... A few days later I've found a new defect introduced with 31966b151. The following script: echo " CREATE TABLE tbl(id int); INSERT INTO tbl(id) SELECT i FROM generate_series(1, 1000) i; DELETE FROM tbl; CHECKPOINT; " | psql -q sleep 2 grep -C2 'automatic vacuum of table ".*.tbl"' server.log tf=$(psql -Aqt -c "SELECT format('%s/%s', pg_database.oid, relfilenode) FROM pg_database, pg_class WHERE datname = current_database() AND relname = 'tbl'") ls -l "$PGDB/base/$tf" pg_ctl -D $PGDB stop -m immediate pg_ctl -D $PGDB -l server.log start with the autovacuum enabled as follows: autovacuum = on log_autovacuum_min_duration = 0 autovacuum_naptime = 1 gives: 2023-04-11 20:56:56.261 MSK [675708] LOG: checkpoint starting: immediate force wait 2023-04-11 20:56:56.324 MSK [675708] LOG: checkpoint complete: wrote 900 buffers (5.5%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.016 s, sync=0.034 s, total=0.063 s; sync files=252, longest=0.017 s, average=0.001 s; distance=4162 kB, estimate=4162 kB; lsn=0/1898588, redo lsn=0/1898550 2023-04-11 20:56:57.558 MSK [676060] LOG: automatic vacuum of table "testdb.public.tbl": index scans: 0 pages: 5 removed, 0 remain, 5 scanned (100.00% of total) tuples: 1000 removed, 0 remain, 0 are dead but not yet removable -rw------- 1 law law 0 апр 11 20:56 .../tmpdb/base/16384/16385 waiting for server to shut down.... done server stopped waiting for server to start.... stopped waiting pg_ctl: could not start server Examine the log output. server stops with the following stack trace: Core was generated by `postgres: startup recovering 000000010000000000000001 '. Program terminated with signal SIGABRT, Aborted. warning: Section `.reg-xstate/790626' in core file too small. #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140209454906240, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007f850ec53476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007f850ec397f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x0000557950889c0b in ExceptionalCondition ( conditionName=0x557950a67680 "mode == RBM_NORMAL || mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_ON_ERROR", fileName=0x557950a673e8 "bufmgr.c", lineNumber=1008) at assert.c:66 #6 0x000055795064f2d0 in ReadBuffer_common (smgr=0x557952739f38, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=4294967295, mode=RBM_ZERO_AND_CLEANUP_LOCK, strategy=0x0, hit=0x7fff22dd648f) at bufmgr.c:1008 #7 0x000055795064ebe7 in ReadBufferWithoutRelcache (rlocator=..., forkNum=MAIN_FORKNUM, blockNum=4294967295, mode=RBM_ZERO_AND_CLEANUP_LOCK, strategy=0x0, permanent=true) at bufmgr.c:800 #8 0x000055795021c0fa in XLogReadBufferExtended (rlocator=..., forknum=MAIN_FORKNUM, blkno=0, mode=RBM_ZERO_AND_CLEANUP_LOCK, recent_buffer=0) at xlogutils.c:536 #9 0x000055795021bd92 in XLogReadBufferForRedoExtended (record=0x5579526c4998, block_id=0 '\000', mode=RBM_NORMAL, get_cleanup_lock=true, buf=0x7fff22dd6598) at xlogutils.c:391 #10 0x00005579501783b1 in heap_xlog_prune (record=0x5579526c4998) at heapam.c:8726 #11 0x000055795017b7db in heap2_redo (record=0x5579526c4998) at heapam.c:9960 #12 0x0000557950215b34 in ApplyWalRecord (xlogreader=0x5579526c4998, record=0x7f85053d0120, replayTLI=0x7fff22dd6720) at xlogrecovery.c:1915 #13 0x0000557950215611 in PerformWalRecovery () at xlogrecovery.c:1746 #14 0x0000557950201ce3 in StartupXLOG () at xlog.c:5433 #15 0x00005579505cb6d2 in StartupProcessMain () at startup.c:267 #16 0x00005579505be9f7 in AuxiliaryProcessMain (auxtype=StartupProcess) at auxprocess.c:141 #17 0x00005579505ca2b5 in StartChildProcess (type=StartupProcess) at postmaster.c:5369 #18 0x00005579505c5224 in PostmasterMain (argc=3, argv=0x5579526c3e70) at postmaster.c:1455 #19 0x000055795047a97d in main (argc=3, argv=0x5579526c3e70) at main.c:200 As I can see, autovacuum removes pages from the table file, and this causes the crash while replaying the record: rmgr: Heap2 len (rec/tot): 60/ 988, tx: 0, lsn: 0/01898600, prev 0/01898588, desc: PRUNE snapshotConflictHorizon 732 nredirected 0 ndead 226, blkref #0: rel 1663/16384/16385 blk 0 FPW Best regards, Alexander
Hi, On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote: > A few days later I've found a new defect introduced with 31966b151. That's the same issue that Tom also just reported, at https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us Attached is my WIP fix, including a test. Greetings, Andres Freund
Attachment
12.04.2023 02:21, Andres Freund wrote: > Hi, > > On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote: >> A few days later I've found a new defect introduced with 31966b151. > That's the same issue that Tom also just reported, at > https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us > > Attached is my WIP fix, including a test. Thanks for the fix. I can confirm that the issue is gone. ReadBuffer_common() contains an Assert(), that is similar to the fixed one, but it looks unreachable for the WAL replay case after 26158b852. Best regards, Alexander
Hi, On 2023-04-12 08:00:00 +0300, Alexander Lakhin wrote: > 12.04.2023 02:21, Andres Freund wrote: > > Hi, > > > > On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote: > > > A few days later I've found a new defect introduced with 31966b151. > > That's the same issue that Tom also just reported, at > > https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us > > > > Attached is my WIP fix, including a test. > > Thanks for the fix. I can confirm that the issue is gone. > ReadBuffer_common() contains an Assert(), that is similar to the fixed one, > but it looks unreachable for the WAL replay case after 26158b852. Good catch. I implemented it there too. As now all of the modes are supported, I removed the assertion. I also extended the test slightly to also test the case of dropped relations. Greetings, Andres Freund
Hi,
Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want to use the same benchmarks.
Thanks,
Muhammad
From: Andres Freund <andres@anarazel.de>
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; Thomas Munro <thomas.munro@gmail.com>; Melanie Plageman <melanieplageman@gmail.com>
Cc: Yura Sokolov <y.sokolov@postgrespro.ru>; Robert Haas <robertmhaas@gmail.com>
Subject: refactoring relation extension and BufferAlloc(), faster COPY
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; Thomas Munro <thomas.munro@gmail.com>; Melanie Plageman <melanieplageman@gmail.com>
Cc: Yura Sokolov <y.sokolov@postgrespro.ru>; Robert Haas <robertmhaas@gmail.com>
Subject: refactoring relation extension and BufferAlloc(), faster COPY
Hi,
I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.
In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.
The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().
Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:
1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS
1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.
The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.
My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.
To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().
This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.
Other interesting bits I found:
a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
does, nearly doubles the amount of writes. First the kernel ends up writing
out all the zeroed out buffers after a while, then when we write out the
actual buffer contents.
The best fix for that seems to be to optionally use posix_fallocate() to
reserve space, without dirtying pages in the kernel page cache. However, it
looks like that's only beneficial when extending by multiple pages at once,
because it ends up causing one filesystem-journal entry for each extension
on at least some filesystems.
I added 'smgrzeroextend()' that can extend by multiple blocks, without the
caller providing a buffer to write out. When extending by 8 or more blocks,
posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
used to extend the file.
b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.
One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.
After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().
For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.
c) Contention on the FSM and the pages returned by it is a serious bottleneck
after a) and b).
The biggest issue is that the current bulk insertion logic in hio.c enters
all but one of the new pages into the freespacemap. That will immediately
cause all the other backends to contend on the first few pages returned the
FSM, and cause contention on the FSM pages itself.
I've, partially, addressed that by using the information about the required
number of pages from b). Whether we bulk insert or not, the number of pages
we know we're going to need for one heap_multi_insert() don't need to be
added to the FSM - we're going to use them anyway.
I've stashed the number of free blocks in the BulkInsertState for now, but
I'm not convinced that that's the right place.
If I revert just this part, the "concurrent COPY into unlogged table"
benchmark goes from ~240 tps to ~190 tps.
Even after that change the FSM is a major bottleneck. Below I included
benchmarks showing this by just removing the use of the FSM, but I haven't
done anything further about it. The contention seems to be both from
updating the FSM, as well as thundering-herd like symptoms from accessing
the FSM.
The update part could likely be addressed to some degree with a batch
update operation updating the state for multiple pages.
The access part could perhaps be addressed by adding an operation that gets
a page and immediately marks it as fully used, so other backends won't also
try to access it.
d) doing
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);
under the extension lock is surprisingly expensive on my two socket
workstation (but much less noticable on my laptop).
If I move the MemSet back under the extension lock, the "concurrent COPY
into unlogged table" benchmark goes from ~240 tps to ~200 tps.
e) When running a few benchmarks for this email, I noticed that there was a
sharp performance dropoff for the patched code for a pgbench -S -s100 on a
database with 1GB s_b, start between 512 and 1024 clients. This started with
the patch only acquiring one buffer partition lock at a time. Lots of
debugging ensued, resulting in [3].
The problem isn't actually related to the change, it just makes it more
visible, because the "lock chains" between two partitions reduce the
average length of the wait queues substantially, by distribution them
between more partitions. [3] has a reproducer that's entirely independent
of this patchset.
Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.
This requires two new pieces of infrastructure:
First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).
Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).
In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.
The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock). That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.
I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.
server "base" configuration:
max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB
benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).
To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.
s_b=24GB
test: unlogged_small_files, format: text, files: 1024, 9015MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 58.63 207 50.22 242 54.35 224
2 32.67 372 25.82 472 27.30 446
4 22.53 540 13.30 916 14.33 851
8 15.14 804 7.43 1640 7.48 1632
16 14.69 829 4.79 2544 4.50 2718
32 15.28 797 4.41 2763 3.32 3710
64 15.34 794 5.22 2334 3.06 4061
128 15.49 786 4.97 2452 3.13 3926
256 15.85 768 5.02 2427 3.26 3769
512 16.02 760 5.29 2303 3.54 3471
test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 68.18 178 59.41 205 63.43 192
2 39.71 306 33.10 368 34.99 348
4 27.26 446 19.75 617 20.09 607
8 18.84 646 12.86 947 12.68 962
16 15.96 763 9.62 1266 8.51 1436
32 15.43 789 8.20 1486 7.77 1579
64 16.11 756 8.91 1367 8.90 1383
128 16.41 742 10.00 1218 9.74 1269
256 17.33 702 11.91 1023 10.89 1136
512 18.46 659 14.07 866 11.82 1049
test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 63.27s 192 56.14 217 59.25 205
2 40.17s 303 29.88 407 31.50 386
4 27.57s 442 16.16 754 17.18 709
8 21.26s 573 11.89 1025 11.09 1099
16 21.25s 573 10.68 1141 10.22 1192
32 21.00s 580 10.72 1136 10.35 1178
64 20.64s 590 10.15 1200 9.76 1249
128 skipped
256 skipped
512 skipped
test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 71.89s 169 65.57 217 69.09 69.09
2 47.36s 257 36.22 407 38.71 38.71
4 33.10s 368 21.76 754 22.78 22.78
8 26.62s 457 15.89 1025 15.30 15.30
16 24.89s 489 17.08 1141 15.20 15.20
32 25.15s 484 17.41 1136 16.14 16.14
64 26.11s 466 17.89 1200 16.76 16.76
128 skipped
256 skipped
512 skipped
Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.
test: unlogged_small_files, format: binary, files: 1024, 9508MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 34.14 357 28.04 434 29.46 413
2 22.67 537 14.42 845 14.75 826
4 16.63 732 7.62 1599 7.69 1587
8 13.48 904 4.36 2795 4.13 2959
16 14.37 848 3.78 3224 2.74 4493
32 14.79 823 4.20 2902 2.07 5974
64 14.76 825 5.03 2423 2.21 5561
128 14.95 815 4.36 2796 2.30 5343
256 15.18 802 4.31 2828 2.49 4935
512 15.41 790 4.59 2656 2.84 4327
s_b=4GB
test: unlogged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.55 194 54.22 224
2 37.11 328 28.94 421
4 25.97 469 16.42 742
8 20.01 609 11.92 1022
16 19.55 623 11.05 1102
32 19.34 630 11.27 1081
64 19.07 639 12.04 1012
128 19.22 634 12.27 993
256 19.34 630 12.28 992
512 19.60 621 11.74 1038
test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.71 169 63.63 191
2 46.93 259 36.31 335
4 30.37 401 22.41 543
8 22.86 533 16.90 721
16 20.18 604 14.07 866
32 19.64 620 13.06 933
64 19.71 618 15.08 808
128 19.95 610 15.47 787
256 20.48 595 16.53 737
512 21.56 565 16.86 722
test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.65 194 55.74 218
2 40.25 302 29.45 413
4 27.37 445 16.26 749
8 22.07 552 11.75 1037
16 21.29 572 10.64 1145
32 20.98 580 10.70 1139
64 20.65 590 10.21 1193
128 skipped
256 skipped
512 skipped
test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.72 169 65.12 187
2 46.46 262 35.74 341
4 32.61 373 21.60 564
8 26.69 456 16.30 747
16 25.31 481 17.00 716
32 24.96 488 17.47 697
64 26.05 467 17.90 680
128 skipped
256 skipped
512 skipped
test: unlogged_small_files, format: binary, files: 1024, 9505MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 37.62 323 32.77 371
2 28.35 429 18.89 645
4 20.87 583 12.18 1000
8 19.37 629 10.38 1173
16 19.41 627 10.36 1176
32 18.62 654 11.04 1103
64 18.33 664 11.89 1024
128 18.41 661 11.91 1023
256 18.52 658 12.10 1007
512 18.78 648 11.49 1060
benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.
The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...
One would need to run pgbench for impractically long to make that effect
vanish.
My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.
s_b=1GB
tps tps
clients master patched
1 49541 48805
2 85342 90010
4 167340 168918
8 308194 303222
16 524294 523678
32 649516 649100
64 932547 937702
128 908249 906281
256 856496 903979
512 764254 934702
1024 653886 925113
2048 569695 917262
4096 526782 903258
s_b=128MB:
tps tps
clients master patched
1 40407 39854
2 73180 72252
4 143334 140860
8 240982 245331
16 429265 420810
32 544593 540127
64 706408 726678
128 713142 718087
256 611030 695582
512 552751 686290
1024 508248 666370
2048 474108 656735
4096 448582 633040
As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.
To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.
master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4
Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.
My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:
no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2
So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.
I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:
- We could preallocate an initial element inside the ResourceArray struct, so
that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...
The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.
Greetings,
Andres Freund
[0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru
[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
[3] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.
In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.
The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().
Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:
1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS
1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.
The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.
My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.
To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().
This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.
Other interesting bits I found:
a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
does, nearly doubles the amount of writes. First the kernel ends up writing
out all the zeroed out buffers after a while, then when we write out the
actual buffer contents.
The best fix for that seems to be to optionally use posix_fallocate() to
reserve space, without dirtying pages in the kernel page cache. However, it
looks like that's only beneficial when extending by multiple pages at once,
because it ends up causing one filesystem-journal entry for each extension
on at least some filesystems.
I added 'smgrzeroextend()' that can extend by multiple blocks, without the
caller providing a buffer to write out. When extending by 8 or more blocks,
posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
used to extend the file.
b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.
One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.
After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().
For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.
c) Contention on the FSM and the pages returned by it is a serious bottleneck
after a) and b).
The biggest issue is that the current bulk insertion logic in hio.c enters
all but one of the new pages into the freespacemap. That will immediately
cause all the other backends to contend on the first few pages returned the
FSM, and cause contention on the FSM pages itself.
I've, partially, addressed that by using the information about the required
number of pages from b). Whether we bulk insert or not, the number of pages
we know we're going to need for one heap_multi_insert() don't need to be
added to the FSM - we're going to use them anyway.
I've stashed the number of free blocks in the BulkInsertState for now, but
I'm not convinced that that's the right place.
If I revert just this part, the "concurrent COPY into unlogged table"
benchmark goes from ~240 tps to ~190 tps.
Even after that change the FSM is a major bottleneck. Below I included
benchmarks showing this by just removing the use of the FSM, but I haven't
done anything further about it. The contention seems to be both from
updating the FSM, as well as thundering-herd like symptoms from accessing
the FSM.
The update part could likely be addressed to some degree with a batch
update operation updating the state for multiple pages.
The access part could perhaps be addressed by adding an operation that gets
a page and immediately marks it as fully used, so other backends won't also
try to access it.
d) doing
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);
under the extension lock is surprisingly expensive on my two socket
workstation (but much less noticable on my laptop).
If I move the MemSet back under the extension lock, the "concurrent COPY
into unlogged table" benchmark goes from ~240 tps to ~200 tps.
e) When running a few benchmarks for this email, I noticed that there was a
sharp performance dropoff for the patched code for a pgbench -S -s100 on a
database with 1GB s_b, start between 512 and 1024 clients. This started with
the patch only acquiring one buffer partition lock at a time. Lots of
debugging ensued, resulting in [3].
The problem isn't actually related to the change, it just makes it more
visible, because the "lock chains" between two partitions reduce the
average length of the wait queues substantially, by distribution them
between more partitions. [3] has a reproducer that's entirely independent
of this patchset.
Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.
This requires two new pieces of infrastructure:
First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).
Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).
In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.
The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock). That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.
I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.
server "base" configuration:
max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB
benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).
To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.
s_b=24GB
test: unlogged_small_files, format: text, files: 1024, 9015MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 58.63 207 50.22 242 54.35 224
2 32.67 372 25.82 472 27.30 446
4 22.53 540 13.30 916 14.33 851
8 15.14 804 7.43 1640 7.48 1632
16 14.69 829 4.79 2544 4.50 2718
32 15.28 797 4.41 2763 3.32 3710
64 15.34 794 5.22 2334 3.06 4061
128 15.49 786 4.97 2452 3.13 3926
256 15.85 768 5.02 2427 3.26 3769
512 16.02 760 5.29 2303 3.54 3471
test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 68.18 178 59.41 205 63.43 192
2 39.71 306 33.10 368 34.99 348
4 27.26 446 19.75 617 20.09 607
8 18.84 646 12.86 947 12.68 962
16 15.96 763 9.62 1266 8.51 1436
32 15.43 789 8.20 1486 7.77 1579
64 16.11 756 8.91 1367 8.90 1383
128 16.41 742 10.00 1218 9.74 1269
256 17.33 702 11.91 1023 10.89 1136
512 18.46 659 14.07 866 11.82 1049
test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 63.27s 192 56.14 217 59.25 205
2 40.17s 303 29.88 407 31.50 386
4 27.57s 442 16.16 754 17.18 709
8 21.26s 573 11.89 1025 11.09 1099
16 21.25s 573 10.68 1141 10.22 1192
32 21.00s 580 10.72 1136 10.35 1178
64 20.64s 590 10.15 1200 9.76 1249
128 skipped
256 skipped
512 skipped
test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 71.89s 169 65.57 217 69.09 69.09
2 47.36s 257 36.22 407 38.71 38.71
4 33.10s 368 21.76 754 22.78 22.78
8 26.62s 457 15.89 1025 15.30 15.30
16 24.89s 489 17.08 1141 15.20 15.20
32 25.15s 484 17.41 1136 16.14 16.14
64 26.11s 466 17.89 1200 16.76 16.76
128 skipped
256 skipped
512 skipped
Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.
test: unlogged_small_files, format: binary, files: 1024, 9508MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 34.14 357 28.04 434 29.46 413
2 22.67 537 14.42 845 14.75 826
4 16.63 732 7.62 1599 7.69 1587
8 13.48 904 4.36 2795 4.13 2959
16 14.37 848 3.78 3224 2.74 4493
32 14.79 823 4.20 2902 2.07 5974
64 14.76 825 5.03 2423 2.21 5561
128 14.95 815 4.36 2796 2.30 5343
256 15.18 802 4.31 2828 2.49 4935
512 15.41 790 4.59 2656 2.84 4327
s_b=4GB
test: unlogged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.55 194 54.22 224
2 37.11 328 28.94 421
4 25.97 469 16.42 742
8 20.01 609 11.92 1022
16 19.55 623 11.05 1102
32 19.34 630 11.27 1081
64 19.07 639 12.04 1012
128 19.22 634 12.27 993
256 19.34 630 12.28 992
512 19.60 621 11.74 1038
test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.71 169 63.63 191
2 46.93 259 36.31 335
4 30.37 401 22.41 543
8 22.86 533 16.90 721
16 20.18 604 14.07 866
32 19.64 620 13.06 933
64 19.71 618 15.08 808
128 19.95 610 15.47 787
256 20.48 595 16.53 737
512 21.56 565 16.86 722
test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.65 194 55.74 218
2 40.25 302 29.45 413
4 27.37 445 16.26 749
8 22.07 552 11.75 1037
16 21.29 572 10.64 1145
32 20.98 580 10.70 1139
64 20.65 590 10.21 1193
128 skipped
256 skipped
512 skipped
test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.72 169 65.12 187
2 46.46 262 35.74 341
4 32.61 373 21.60 564
8 26.69 456 16.30 747
16 25.31 481 17.00 716
32 24.96 488 17.47 697
64 26.05 467 17.90 680
128 skipped
256 skipped
512 skipped
test: unlogged_small_files, format: binary, files: 1024, 9505MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 37.62 323 32.77 371
2 28.35 429 18.89 645
4 20.87 583 12.18 1000
8 19.37 629 10.38 1173
16 19.41 627 10.36 1176
32 18.62 654 11.04 1103
64 18.33 664 11.89 1024
128 18.41 661 11.91 1023
256 18.52 658 12.10 1007
512 18.78 648 11.49 1060
benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.
The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...
One would need to run pgbench for impractically long to make that effect
vanish.
My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.
s_b=1GB
tps tps
clients master patched
1 49541 48805
2 85342 90010
4 167340 168918
8 308194 303222
16 524294 523678
32 649516 649100
64 932547 937702
128 908249 906281
256 856496 903979
512 764254 934702
1024 653886 925113
2048 569695 917262
4096 526782 903258
s_b=128MB:
tps tps
clients master patched
1 40407 39854
2 73180 72252
4 143334 140860
8 240982 245331
16 429265 420810
32 544593 540127
64 706408 726678
128 713142 718087
256 611030 695582
512 552751 686290
1024 508248 666370
2048 474108 656735
4096 448582 633040
As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.
To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.
master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4
Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.
My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:
no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2
So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.
I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:
- We could preallocate an initial element inside the ResourceArray struct, so
that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...
The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.
Greetings,
Andres Freund
[0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru
[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
[3] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Hi, On 2023-05-03 18:25:51 +0000, Muhammad Malik wrote: > Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want touse the same benchmarks. The email should contain all the necessary information. What are you missing? c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench -n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0' > I've done a fair bit of benchmarking of this patchset. For COPY it comes out > ahead everywhere. It's possible that there's a very small regression for > extremly IO miss heavy workloads, more below. > > > server "base" configuration: > > max_wal_size=150GB > shared_buffers=24GB > huge_pages=on > autovacuum=0 > backend_flush_after=2MB > max_connections=5000 > wal_buffers=128MB > wal_segment_size=1GB > > benchmark: pgbench running COPY into a single table. pgbench -t is set > according to the client count, so that the same amount of data is inserted. > This is done oth using small files ([1], ringbuffer not effective, no dirty > data to write out within the benchmark window) and a bit larger files ([2], > lots of data to write out due to ringbuffer). I use a script like: c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench-n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0' > [1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMATtest); > [2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMATtext); Greetings, Andres Freund
> I use a script like:
> c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench -n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0'
> >[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
> >[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
When I ran this script it did not insert anything into the copytest_0 table. It only generated a single copytest_data_text.copy file of size 9.236MB.
Please help me understand how is this 'pgbench running COPY into a single table'. Also what are the 'seconds' and 'tbl-MBs' metrics that were reported.
Thank you,
Muhammad
Hi, On 2023-05-03 19:29:46 +0000, Muhammad Malik wrote: > > I use a script like: > > > c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench-n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0' > > > >[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMATtest); > > >[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH(FORMAT text); > > When I ran this script it did not insert anything into the copytest_0 table. It only generated a single copytest_data_text.copyfile of size 9.236MB. > Please help me understand how is this 'pgbench running COPY into a single table'. That's the data generation for the file to be COPYed in. The script passed to pgbench is just something like COPY copytest_0 FROM '/tmp/copytest_data_text.copy'; or COPY copytest_0 FROM '/tmp/copytest_data_binary.copy'; > Also what are the 'seconds' and 'tbl-MBs' metrics that were reported. The total time for inserting N (1024 for the small files, 64 for the larger ones). "tbl-MBs" is size of the resulting table, divided by time. I.e. a measure of throughput. Greetings, Andres Freund