Thread: Re: strange perf regression with data checksums
Hi Tomas, > While running some benchmarks comparing 17 and 18, I ran into a simple > workload where 18 throughput drops by ~80%. After pulling my hair for a > couple hours I realized the change that triggered this is 04bec894a04c, > which set checksums on by default. Which is very bizarre, because the > workload is read-only and fits into shared buffers. > > [...] > > But why would it depend on checksums at all? This read-only test should > be entirely in-memory, so how come it's affected? These are interesting results. Just wanted to clarify: did you make sure that all the hint bits were set before executing the benchmark? I'm not claiming that hint bits are necessarily the reason for the observed behavior but when something is off with presumably read-only queries this is the first reason that comes to mind. At least we should make sure hint bits are excluded from the equation. If memory serves, VACUUM FULL and CHECKPOINT after filling the table and creating the index should do the trick. -- Best regards, Aleksander Alekseev
On 5/9/25 14:53, Aleksander Alekseev wrote: > Hi Tomas, > >> While running some benchmarks comparing 17 and 18, I ran into a simple >> workload where 18 throughput drops by ~80%. After pulling my hair for a >> couple hours I realized the change that triggered this is 04bec894a04c, >> which set checksums on by default. Which is very bizarre, because the >> workload is read-only and fits into shared buffers. >> >> [...] >> >> But why would it depend on checksums at all? This read-only test should >> be entirely in-memory, so how come it's affected? > > These are interesting results. > > Just wanted to clarify: did you make sure that all the hint bits were > set before executing the benchmark? > > I'm not claiming that hint bits are necessarily the reason for the > observed behavior but when something is off with presumably read-only > queries this is the first reason that comes to mind. At least we > should make sure hint bits are excluded from the equation. If memory > serves, VACUUM FULL and CHECKPOINT after filling the table and > creating the index should do the trick. > Good question. I haven't checked that explicitly, but it's a tiny data set (15MB) and I observed this even on long benchmarks with tens of millions of queries. So the hint bits should have been set. Also, I should have mentioned the query does an index-only scan, and the pin/unpin calls are on index pages, not on the heap. regards -- Tomas Vondra
Hi, > > I'm not claiming that hint bits are necessarily the reason for the > > observed behavior but when something is off with presumably read-only > > queries this is the first reason that comes to mind. At least we > > should make sure hint bits are excluded from the equation. If memory > > serves, VACUUM FULL and CHECKPOINT after filling the table and > > creating the index should do the trick. > > Good question. I haven't checked that explicitly, but it's a tiny data > set (15MB) and I observed this even on long benchmarks with tens of > millions of queries. So the hint bits should have been set. > > Also, I should have mentioned the query does an index-only scan, and the > pin/unpin calls are on index pages, not on the heap. There is one more thing I would check. As I recall perf shows only on-CPU time while actually the backends may be sleeping on the locks most of the time. If this is the case perf will not show you the accurate picture. In order to check this personally I create gdb.script with a single GDB command: ``` bt ``` And execute: ``` gdb --batch --command=gdb.script -p (backend_pid_here) ``` ... 10+ times or so. If what you are observing is actually a lock contention and the backend sleeps on a lock most of the time, 8/10 or so stacktraces will show you this. I assume of course that the benchmark is done on release builds with disabled Asserts, etc. BTW do you believe this is a problem related exclusively to NUMA CPUs with 90+ cores or I can reproduce it on SMT as well? -- Best regards, Aleksander Alekseev
On Fri, May 9, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote: > Good question. I haven't checked that explicitly, but it's a tiny data > set (15MB) and I observed this even on long benchmarks with tens of > millions of queries. So the hint bits should have been set. > > Also, I should have mentioned the query does an index-only scan, and the > pin/unpin calls are on index pages, not on the heap. We don't actually need to call BufferGetLSNAtomic() from _bt_readpage during index-only scans (nor during bitmap index scans). We can just not call BufferGetLSNAtomic() at all (except during plain index scans), with no possible downside. In general, we call BufferGetLSNAtomic() to stash the page LSN within so->currPos.lsn, for later. so->currPos.lsn provides us with a way to detect whether the page was modified during the period in which we dropped our pin on the leaf page -- plain index scans cannot set LP_DEAD bits on dead index tuples within _bt_killitems() if the page has changed. But, index-only scans never drop the pin on the leaf page to begin with, and so don't even use so->currPos.lsn (bitmap index scans *never* call _bt_killitems(), and so obviously have no possible use for so->currPos.lsn, either). -- Peter Geoghegan
On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote: > For index-only scans, yes. Great. > The regular index scan however still have this issue, although it's not > as visible as for IOS. We can do somewhat better with plain index scans than my initial v1 prototype, without any major difficulties. There's more low-hanging fruit. We could also move the call to BufferGetLSNAtomic (that currently takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin. That way we'd only need to call BufferGetLSNAtomic for those leaf pages that will actually need to have some index tuples returned to the scan (through the btgettuple interface). In other words, we only need to call BufferGetLSNAtomic for pages that _bt_readpage returns "true" for when called. There are plenty of leaf pages that _bt_readpage will return "false" for, especially during large range scans, and skip scans. It's easy to see why this extension to my v1 POC is correct: the whole point of dropping the leaf page pin is that we don't block VACUUM when btgettuple returns -- but btgettuple isn't going to return until the next call to _bt_readpage that returns "true" actually takes place (or until the whole scan ends). -- Peter Geoghegan
On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote: > I think we can do better - something like > > #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY > lsn = PageGetLSN(page); > #else > buf_state = LockBufHdr(bufHdr); > lsn = PageGetLSN(page); > UnlockBufHdr(bufHdr, buf_state); > #endif > > All perf relevant systems support reading 8 bytes without a chance of > tearing... Even assuming that this scheme works perfectly, I'd still like to pursue the idea I had about fixing this in nbtree. The relevant nbtree code seems more elegant if we avoid calling BufferGetLSNAtomic() unless and until its return value might actually be useful. It's quite a lot easier to understand the true purpose of so->currPos.lsn with this new structure. -- Peter Geoghegan
On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote: > Same effect as v1 for IOS, with regular index scans I see this: > > 64 clients: 0.7M tps > 96 clients: 1.5M tps > > So very similar improvement as for IOS. Just to be clear, you're saying my v2 appears to fix the problem fully, for both plain index scans and index-only scans? (You haven't mentioned bitmap index scans at all, I think, even though they are relevant to v2.) I'd be surprised if my v2 *fully* fixed the issue with plain index scans. It's only going to avoid calls to BufferGetLSNAtomic() following _bt_readpage calls that return false, but I doubt that your particular pgbench-variant test queries really look like that. *Looks again* Oh, wait. This is another one of those benchmarks with an index scan that returns no rows whatsoever (just like on the other recent thread involving regressions tied to the introduction of a new skip support support function to nbtree). Fully makes sense that my v2 would fix that particular problem, even with plain index scans. But...what about slightly different queries, that actually do return some rows? Those will look different. -- Peter Geoghegan
On 5/19/25 22:29, Peter Geoghegan wrote: > On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote: >> Same effect as v1 for IOS, with regular index scans I see this: >> >> 64 clients: 0.7M tps >> 96 clients: 1.5M tps >> >> So very similar improvement as for IOS. > > Just to be clear, you're saying my v2 appears to fix the problem > fully, for both plain index scans and index-only scans? (You haven't > mentioned bitmap index scans at all, I think, even though they are > relevant to v2.) With v2 the regression disappears, both for index-only scans and regular index scans. I haven't tried anything with bitmap scans - I hit the regression mostly by accident, on a workload that does IOS only. I may try constructing something with bitmap scans, but I didn't have time for that right now. > > I'd be surprised if my v2 *fully* fixed the issue with plain index > scans. It's only going to avoid calls to BufferGetLSNAtomic() > following _bt_readpage calls that return false, but I doubt that your > particular pgbench-variant test queries really look like that. > Not sure. My workload is pretty simple, and similar to what I used in the other nbtree regression (with malloc overhead), except that it's not using partitioning. pgbench -i -s 1 test psql test <<EOF update pgbench_accounts set bid = aid; create index on pgbench_accounts(bid); vacuum full; analyze; EOF # select.sql set enable_indexonlyscan = off; select count(*) from pgbench_accounts where bid = 1 I don't know what "fully fix" means in this context. I see a massive improvement with v2, I have no idea if that's the best we could do. > *Looks again* Oh, wait. This is another one of those benchmarks with > an index scan that returns no rows whatsoever (just like on the other > recent thread involving regressions tied to the introduction of a new > skip support support function to nbtree). Fully makes sense that my v2 > would fix that particular problem, even with plain index scans. > But...what about slightly different queries, that actually do return > some rows? Those will look different. > Actually, this particular query does return rows (one, to be precise). But you're right - it seems sensitive to how many rows are returned, and at some point the contention goes away and there's no regression. I need to do proper automated testing, to get reliable results. I've been doing manual testing, but it's easy to make mistakes that way. Do you have any suggestions what cases you'd like me to test? regards -- Tomas Vondra
On Mon, May 19, 2025 at 8:21 PM Tomas Vondra <tomas@vondra.me> wrote: > With v2 the regression disappears, both for index-only scans and regular > index scans. I haven't tried anything with bitmap scans - I hit the > regression mostly by accident, on a workload that does IOS only. I may > try constructing something with bitmap scans, but I didn't have time for > that right now. I imagine bitmap index scans will be similar to plain index scans. > I don't know what "fully fix" means in this context. I see a massive > improvement with v2, I have no idea if that's the best we could do. You expected there to be *zero* performance impact from enabling checksums for this workload, since it is a pure read-only workload. That's what I meant by "fully fix". > But you're right - it seems sensitive to how many rows are returned, and > at some point the contention goes away and there's no regression. > > I need to do proper automated testing, to get reliable results. I've > been doing manual testing, but it's easy to make mistakes that way. > > Do you have any suggestions what cases you'd like me to test? Nothing comes to mind. Again, just be aware that we can only completely avoid calling BufferGetLSNAtomic is only possible when: * Scan is an index-only scan OR * Scan is a bitmap index scan OR * Scan is a plain index scan, reading a page that _bt_readpage() returned "false" for when called. In other words, plain index scans that read a lot of tuples might receive no benefit whatsoever. It's possible that it already matters less there anyway. It's also possible that there is some unforeseen benefit from merely *delaying* the call to BufferGetLSNAtomic. But in all likelihood these "unfixed" plain index scans are just as fast with v2 as they are when run on master/baseline. -- Peter Geoghegan
On Tue, May 20, 2025 at 8:43 AM Peter Geoghegan <pg@bowt.ie> wrote: > I imagine bitmap index scans will be similar to plain index scans. To be clear, I meant that the magnitude of the problem will be similar. I do not mean that they aren't fixed by my v2. They should be fully fixed by v2. -- Peter Geoghegan
Hi, On 2025-05-19 15:45:01 -0400, Peter Geoghegan wrote: > On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote: > > I think we can do better - something like > > > > #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY > > lsn = PageGetLSN(page); > > #else > > buf_state = LockBufHdr(bufHdr); > > lsn = PageGetLSN(page); > > UnlockBufHdr(bufHdr, buf_state); > > #endif > > > > All perf relevant systems support reading 8 bytes without a chance of > > tearing... > > Even assuming that this scheme works perfectly, I'd still like to > pursue the idea I had about fixing this in nbtree. > > The relevant nbtree code seems more elegant if we avoid calling > BufferGetLSNAtomic() unless and until its return value might actually > be useful. It's quite a lot easier to understand the true purpose of > so->currPos.lsn with this new structure. I'm not against that - ISTM we should do both. Greetings, Andres Freund
Hi, We had a RMT meeting yesterday, and we briefly discussed whether the v2 patch should go into PG18. And we concluded we probably should try to do that. On the one hand, it's not a PG18-specific bug, in the sense that older releases are affected the same way. But it requires data checksums to be enabled, and a lot of systems runs with checksums=off, simply because that's the default. But 18 will (likely) change that. We didn't get any complaints about this (at least I'm not aware of any), but I suppose that's simply because people didn't have a comparison, and treated it as "normal". But with the default changes, it'll be easier to spot once they upgrade to PG18. So better to get this in now, otherwise we may have to wait until PG19, because of ABI (the patch adds a field into BTScanPosData, but maybe it'd be possible to add it into padding, not sure). regards -- Tomas Vondra
On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote: > So better to get this in now, otherwise we may have to wait until PG19, > because of ABI (the patch adds a field into BTScanPosData, but maybe > it'd be possible to add it into padding, not sure). I agree. I can get this into shape for commit today. Does anybody object to my proceeding with committing the patch on the master branch/putting it in Postgres 18? (FWIW I could probably fit the new field into some BTScanPosData alignment padding, but I don't favor back patching.) I consider my patch to be low risk. There's a kind of symmetry to how things work with the patch in place, which IMV makes things simpler. -- Peter Geoghegan
On Wed, Jun 4, 2025 at 10:21 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote: > > So better to get this in now, otherwise we may have to wait until PG19, > > because of ABI (the patch adds a field into BTScanPosData, but maybe > > it'd be possible to add it into padding, not sure). > > I agree. I can get this into shape for commit today. Attached is v3, which is functionally the same as v2. It has improved comments, and a couple of extra assertions. Plus there's a real commit message now. I also relocated the code that sets so.drop_pin from _bt_preprocess_keys to btrescan. That seems like the more logical place for it. My current plan is to commit this in the next couple of days. -- Peter Geoghegan
Attachment
On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > My current plan is to commit this in the next couple of days. Pushed. It would be great if we could also teach BufferGetLSNAtomic() to just call PageGetLSN() (at least on PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY platforms), but my sense is that that's not going to happen in Postgres 18. -- Peter Geoghegan
Hello Peter,
06.06.2025 19:33, Peter Geoghegan wrote:
06.06.2025 19:33, Peter Geoghegan wrote:
On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote:My current plan is to commit this in the next couple of days.Pushed.
Please look at the following script, which falsifies an assertion
introduced with e6eed40e4:
create user u;
grant all on schema public to u;
set session authorization u;
create table tbl1 (a int);
create function f1() returns int language plpgsql as $$ invalid $$;
select sum(1) over (partition by 1 order by objid)
from pg_shdepend
left join pg_stat_sys_indexes on refclassid = relid
left join tbl1 on true
limit 1;
(I've simplified an assert-triggering query generated by SQLsmith.)
TRAP: failed Assert("!BTScanPosIsPinned(so->currPos)"), File: "nbtutils.c", Line: 3379, PID: 1621028
ExceptionalCondition at assert.c:52:13
_bt_killitems at nbtutils.c:3380:3
_bt_steppage at nbtsearch.c:2134:5
_bt_next at nbtsearch.c:1560:7
btgettuple at nbtree.c:276:6
index_getnext_tid at indexam.c:640:25
index_getnext_slot at indexam.c:729:10
IndexNext at nodeIndexscan.c:131:9
ExecScanFetch at execScan.h:126:10
...
Best regards,
Alexander
Hello, Peter! > > I also relocated the code that sets so.drop_pin from > _bt_preprocess_keys to btrescan. That seems like the more logical > place for it. > I was rebasing [0] and noticed dropPin is not initialized in btbeginscan, which seems to be suspicious for me. Is it intended? [0]: https://commitfest.postgresql.org/patch/5151/
Hi Alexander, On Mon, Jun 9, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > Please look at the following script, which falsifies an assertion > TRAP: failed Assert("!BTScanPosIsPinned(so->currPos)"), File: "nbtutils.c", Line: 3379, PID: 1621028 I can reproduce this. I think that it's an old bug. The problem is related to mark/restore (used by merge joins), where nbtree can sometimes independently hold a second pin on leaf pages (albeit very briefly, inside _bt_steppage). The cause of the failure of my recently-added assertion is that it assumes that so->dropPin always implies !BTScanPosIsPinned(so->currPos) at the start of _bt_killitems. I don't think that that assumption is unreasonable, even in light of this assertion problem. I think that the real problem is the way that _bt_killitems releases resources. _bt_killitems doesn't try to leave so->currPos as it was at the start of its call -- this is nothing new. The overall effect is that during so->dropPin _bt_steppage calls, we *generally* don't call IncrBufferRefCount in the "bump pin on current buffer for assignment to mark buffer" path, but *will* call IncrBufferRefCount when we happened to need to call _bt_killitems. Calling _bt_killitems shouldn't have these side-effects. I can make the assertion failure go away by teaching _bt_killitems to leave so->currPos in the same state that it was in when _bt_killitems was called. If there was no pin on the page when _bt_killitems was called, then there should be no pin held when it returns. See the attached patch. As I said, I think that this is actually an old bug. After all, _bt_killitems is required to test so->currPos.lsn against the same page's current LSN if the page pin was dropped since _bt_readpage was first called -- regardless of any other details. I don't think that it'll consistently do that in this hard-to-test mark/restore path on any Postgres version, so there might be a risk of failing to detect an unsafe concurrent TID recycling hazard. I've always thought that the whole idiom of testing BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if it makes sense to totally replace those calls/tests with similar tests of the new so->dropPin field. -- Peter Geoghegan
Attachment
On Mon, Jun 9, 2025 at 8:48 AM Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > I was rebasing [0] and noticed dropPin is not initialized in > btbeginscan, which seems to be suspicious for me. > Is it intended? That's pretty normal. We don't have access to the scan descriptor within btbeginscan. This is also why we do things like allocate so->currTuples within btrescan. We don't yet know if the scan will be an index-only scan when btbeginscan is called. -- Peter Geoghegan
On Mon, Jun 9, 2025 at 10:47 AM Peter Geoghegan <pg@bowt.ie> wrote: > I've always thought that the whole idiom of testing > BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if > it makes sense to totally replace those calls/tests with similar tests > of the new so->dropPin field. It's definitely possible to avoid testing BTScanPosIsPinned like this entirely. Attached v2 patch shows how this is possible. v2 fully removes BTScanPosUnpinIfPinned, and limits the use of BTScanPosIsPinned to assertions. I think that this approach makes our buffer pin resource management strategy much clearer, particularly in cases involving mark/restore. This might help with Tomas Vondra's index prefetching patch -- I know that Tomas found the mark/restore aspects of his latest approach challenging. In v2, I added an assertion at the top of _bt_steppage that independently verify the same conditions to the ones that are already tested whenever _bt_steppage calls _bt_killitems (these include the assertion that Alexander showed could fail): static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir) { BTScanOpaque so = (BTScanOpaque) scan->opaque; BlockNumber blkno, lastcurrblkno; Assert(BTScanPosIsValid(so->currPos)); if (!so->dropPin) Assert(BTScanPosIsPinned(so->currPos)); else Assert(!BTScanPosIsPinned(so->currPos)); ... That way if there are remaining issues like the ones that Alexander reported, we don't rely on reaching _bt_killitems to get an assertion failure. (Note that adding these assertions necessitated making _bt_firstpage call _bt_readnextpage instead of its _bt_steppage wrapper. _bt_steppage is now "Only called when _bt_drop_lock_and_maybe_pin was called for so->currPos", which doesn't apply to the case where _bt_firstpage previously called _bt_steppage. This brings _bt_firstpage in line with _bt_readnextpage itself -- now we only call _bt_steppage for a page that we returned to the scan via btgettuple/a page that might have had its pin released by _bt_drop_lock_and_maybe_pin.) v2 also completely removes all IncrBufferRefCount() calls from nbtree. These were never really necessary, since we don't actually need to hold more than one pin at a time on the same page at any point in cases involving mark/restore (nor any other case). We can always get by with no more than one buffer pin on the same page. There were 2 IncrBufferRefCount() calls total, both of which v2 removes: The first IncrBufferRefCount call removed was in _bt_steppage. We copy so->currPos into so->markPos when leaving the page in _bt_steppage. We're invalidating so->currPos in passing here, so we don't *need* to play games with IncrBufferRefCount -- we can just not release the original pin, based on the understanding that so->markPos has the pin "transferred" to (so->currPos can therefore be invalidated without having its pin unpinned). The second IncrBufferRefCount call removed was in btrestrpos. When we restore a mark, we need to keep around the original mark, in case it needs to be restored a little later -- the so->markPos mark must not go away. But that in itself doesn't necessitate holding multiple buffer pins on the same page at once. We can keep the old mark around, without keeping a separate buffer pin for so->markPos. We can just set so->markItemIndex instead -- that's an alternative (and more efficient) way of representing the mark (the so->markPos representation isn't truly needed). This is okay, since our new so->currPos is for the same page as the mark -- that's how so->markItemIndex is supposed to be used. -- Peter Geoghegan