Thread: Re: strange perf regression with data checksums
Hi Tomas, > While running some benchmarks comparing 17 and 18, I ran into a simple > workload where 18 throughput drops by ~80%. After pulling my hair for a > couple hours I realized the change that triggered this is 04bec894a04c, > which set checksums on by default. Which is very bizarre, because the > workload is read-only and fits into shared buffers. > > [...] > > But why would it depend on checksums at all? This read-only test should > be entirely in-memory, so how come it's affected? These are interesting results. Just wanted to clarify: did you make sure that all the hint bits were set before executing the benchmark? I'm not claiming that hint bits are necessarily the reason for the observed behavior but when something is off with presumably read-only queries this is the first reason that comes to mind. At least we should make sure hint bits are excluded from the equation. If memory serves, VACUUM FULL and CHECKPOINT after filling the table and creating the index should do the trick. -- Best regards, Aleksander Alekseev
On 5/9/25 14:53, Aleksander Alekseev wrote: > Hi Tomas, > >> While running some benchmarks comparing 17 and 18, I ran into a simple >> workload where 18 throughput drops by ~80%. After pulling my hair for a >> couple hours I realized the change that triggered this is 04bec894a04c, >> which set checksums on by default. Which is very bizarre, because the >> workload is read-only and fits into shared buffers. >> >> [...] >> >> But why would it depend on checksums at all? This read-only test should >> be entirely in-memory, so how come it's affected? > > These are interesting results. > > Just wanted to clarify: did you make sure that all the hint bits were > set before executing the benchmark? > > I'm not claiming that hint bits are necessarily the reason for the > observed behavior but when something is off with presumably read-only > queries this is the first reason that comes to mind. At least we > should make sure hint bits are excluded from the equation. If memory > serves, VACUUM FULL and CHECKPOINT after filling the table and > creating the index should do the trick. > Good question. I haven't checked that explicitly, but it's a tiny data set (15MB) and I observed this even on long benchmarks with tens of millions of queries. So the hint bits should have been set. Also, I should have mentioned the query does an index-only scan, and the pin/unpin calls are on index pages, not on the heap. regards -- Tomas Vondra
Hi, > > I'm not claiming that hint bits are necessarily the reason for the > > observed behavior but when something is off with presumably read-only > > queries this is the first reason that comes to mind. At least we > > should make sure hint bits are excluded from the equation. If memory > > serves, VACUUM FULL and CHECKPOINT after filling the table and > > creating the index should do the trick. > > Good question. I haven't checked that explicitly, but it's a tiny data > set (15MB) and I observed this even on long benchmarks with tens of > millions of queries. So the hint bits should have been set. > > Also, I should have mentioned the query does an index-only scan, and the > pin/unpin calls are on index pages, not on the heap. There is one more thing I would check. As I recall perf shows only on-CPU time while actually the backends may be sleeping on the locks most of the time. If this is the case perf will not show you the accurate picture. In order to check this personally I create gdb.script with a single GDB command: ``` bt ``` And execute: ``` gdb --batch --command=gdb.script -p (backend_pid_here) ``` ... 10+ times or so. If what you are observing is actually a lock contention and the backend sleeps on a lock most of the time, 8/10 or so stacktraces will show you this. I assume of course that the benchmark is done on release builds with disabled Asserts, etc. BTW do you believe this is a problem related exclusively to NUMA CPUs with 90+ cores or I can reproduce it on SMT as well? -- Best regards, Aleksander Alekseev
On Fri, May 9, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote: > Good question. I haven't checked that explicitly, but it's a tiny data > set (15MB) and I observed this even on long benchmarks with tens of > millions of queries. So the hint bits should have been set. > > Also, I should have mentioned the query does an index-only scan, and the > pin/unpin calls are on index pages, not on the heap. We don't actually need to call BufferGetLSNAtomic() from _bt_readpage during index-only scans (nor during bitmap index scans). We can just not call BufferGetLSNAtomic() at all (except during plain index scans), with no possible downside. In general, we call BufferGetLSNAtomic() to stash the page LSN within so->currPos.lsn, for later. so->currPos.lsn provides us with a way to detect whether the page was modified during the period in which we dropped our pin on the leaf page -- plain index scans cannot set LP_DEAD bits on dead index tuples within _bt_killitems() if the page has changed. But, index-only scans never drop the pin on the leaf page to begin with, and so don't even use so->currPos.lsn (bitmap index scans *never* call _bt_killitems(), and so obviously have no possible use for so->currPos.lsn, either). -- Peter Geoghegan
On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote: > For index-only scans, yes. Great. > The regular index scan however still have this issue, although it's not > as visible as for IOS. We can do somewhat better with plain index scans than my initial v1 prototype, without any major difficulties. There's more low-hanging fruit. We could also move the call to BufferGetLSNAtomic (that currently takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin. That way we'd only need to call BufferGetLSNAtomic for those leaf pages that will actually need to have some index tuples returned to the scan (through the btgettuple interface). In other words, we only need to call BufferGetLSNAtomic for pages that _bt_readpage returns "true" for when called. There are plenty of leaf pages that _bt_readpage will return "false" for, especially during large range scans, and skip scans. It's easy to see why this extension to my v1 POC is correct: the whole point of dropping the leaf page pin is that we don't block VACUUM when btgettuple returns -- but btgettuple isn't going to return until the next call to _bt_readpage that returns "true" actually takes place (or until the whole scan ends). -- Peter Geoghegan
On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote: > I think we can do better - something like > > #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY > lsn = PageGetLSN(page); > #else > buf_state = LockBufHdr(bufHdr); > lsn = PageGetLSN(page); > UnlockBufHdr(bufHdr, buf_state); > #endif > > All perf relevant systems support reading 8 bytes without a chance of > tearing... Even assuming that this scheme works perfectly, I'd still like to pursue the idea I had about fixing this in nbtree. The relevant nbtree code seems more elegant if we avoid calling BufferGetLSNAtomic() unless and until its return value might actually be useful. It's quite a lot easier to understand the true purpose of so->currPos.lsn with this new structure. -- Peter Geoghegan
On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote: > Same effect as v1 for IOS, with regular index scans I see this: > > 64 clients: 0.7M tps > 96 clients: 1.5M tps > > So very similar improvement as for IOS. Just to be clear, you're saying my v2 appears to fix the problem fully, for both plain index scans and index-only scans? (You haven't mentioned bitmap index scans at all, I think, even though they are relevant to v2.) I'd be surprised if my v2 *fully* fixed the issue with plain index scans. It's only going to avoid calls to BufferGetLSNAtomic() following _bt_readpage calls that return false, but I doubt that your particular pgbench-variant test queries really look like that. *Looks again* Oh, wait. This is another one of those benchmarks with an index scan that returns no rows whatsoever (just like on the other recent thread involving regressions tied to the introduction of a new skip support support function to nbtree). Fully makes sense that my v2 would fix that particular problem, even with plain index scans. But...what about slightly different queries, that actually do return some rows? Those will look different. -- Peter Geoghegan
On 5/19/25 22:29, Peter Geoghegan wrote: > On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote: >> Same effect as v1 for IOS, with regular index scans I see this: >> >> 64 clients: 0.7M tps >> 96 clients: 1.5M tps >> >> So very similar improvement as for IOS. > > Just to be clear, you're saying my v2 appears to fix the problem > fully, for both plain index scans and index-only scans? (You haven't > mentioned bitmap index scans at all, I think, even though they are > relevant to v2.) With v2 the regression disappears, both for index-only scans and regular index scans. I haven't tried anything with bitmap scans - I hit the regression mostly by accident, on a workload that does IOS only. I may try constructing something with bitmap scans, but I didn't have time for that right now. > > I'd be surprised if my v2 *fully* fixed the issue with plain index > scans. It's only going to avoid calls to BufferGetLSNAtomic() > following _bt_readpage calls that return false, but I doubt that your > particular pgbench-variant test queries really look like that. > Not sure. My workload is pretty simple, and similar to what I used in the other nbtree regression (with malloc overhead), except that it's not using partitioning. pgbench -i -s 1 test psql test <<EOF update pgbench_accounts set bid = aid; create index on pgbench_accounts(bid); vacuum full; analyze; EOF # select.sql set enable_indexonlyscan = off; select count(*) from pgbench_accounts where bid = 1 I don't know what "fully fix" means in this context. I see a massive improvement with v2, I have no idea if that's the best we could do. > *Looks again* Oh, wait. This is another one of those benchmarks with > an index scan that returns no rows whatsoever (just like on the other > recent thread involving regressions tied to the introduction of a new > skip support support function to nbtree). Fully makes sense that my v2 > would fix that particular problem, even with plain index scans. > But...what about slightly different queries, that actually do return > some rows? Those will look different. > Actually, this particular query does return rows (one, to be precise). But you're right - it seems sensitive to how many rows are returned, and at some point the contention goes away and there's no regression. I need to do proper automated testing, to get reliable results. I've been doing manual testing, but it's easy to make mistakes that way. Do you have any suggestions what cases you'd like me to test? regards -- Tomas Vondra
On Mon, May 19, 2025 at 8:21 PM Tomas Vondra <tomas@vondra.me> wrote: > With v2 the regression disappears, both for index-only scans and regular > index scans. I haven't tried anything with bitmap scans - I hit the > regression mostly by accident, on a workload that does IOS only. I may > try constructing something with bitmap scans, but I didn't have time for > that right now. I imagine bitmap index scans will be similar to plain index scans. > I don't know what "fully fix" means in this context. I see a massive > improvement with v2, I have no idea if that's the best we could do. You expected there to be *zero* performance impact from enabling checksums for this workload, since it is a pure read-only workload. That's what I meant by "fully fix". > But you're right - it seems sensitive to how many rows are returned, and > at some point the contention goes away and there's no regression. > > I need to do proper automated testing, to get reliable results. I've > been doing manual testing, but it's easy to make mistakes that way. > > Do you have any suggestions what cases you'd like me to test? Nothing comes to mind. Again, just be aware that we can only completely avoid calling BufferGetLSNAtomic is only possible when: * Scan is an index-only scan OR * Scan is a bitmap index scan OR * Scan is a plain index scan, reading a page that _bt_readpage() returned "false" for when called. In other words, plain index scans that read a lot of tuples might receive no benefit whatsoever. It's possible that it already matters less there anyway. It's also possible that there is some unforeseen benefit from merely *delaying* the call to BufferGetLSNAtomic. But in all likelihood these "unfixed" plain index scans are just as fast with v2 as they are when run on master/baseline. -- Peter Geoghegan
On Tue, May 20, 2025 at 8:43 AM Peter Geoghegan <pg@bowt.ie> wrote: > I imagine bitmap index scans will be similar to plain index scans. To be clear, I meant that the magnitude of the problem will be similar. I do not mean that they aren't fixed by my v2. They should be fully fixed by v2. -- Peter Geoghegan
Hi, On 2025-05-19 15:45:01 -0400, Peter Geoghegan wrote: > On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote: > > I think we can do better - something like > > > > #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY > > lsn = PageGetLSN(page); > > #else > > buf_state = LockBufHdr(bufHdr); > > lsn = PageGetLSN(page); > > UnlockBufHdr(bufHdr, buf_state); > > #endif > > > > All perf relevant systems support reading 8 bytes without a chance of > > tearing... > > Even assuming that this scheme works perfectly, I'd still like to > pursue the idea I had about fixing this in nbtree. > > The relevant nbtree code seems more elegant if we avoid calling > BufferGetLSNAtomic() unless and until its return value might actually > be useful. It's quite a lot easier to understand the true purpose of > so->currPos.lsn with this new structure. I'm not against that - ISTM we should do both. Greetings, Andres Freund
Hi, We had a RMT meeting yesterday, and we briefly discussed whether the v2 patch should go into PG18. And we concluded we probably should try to do that. On the one hand, it's not a PG18-specific bug, in the sense that older releases are affected the same way. But it requires data checksums to be enabled, and a lot of systems runs with checksums=off, simply because that's the default. But 18 will (likely) change that. We didn't get any complaints about this (at least I'm not aware of any), but I suppose that's simply because people didn't have a comparison, and treated it as "normal". But with the default changes, it'll be easier to spot once they upgrade to PG18. So better to get this in now, otherwise we may have to wait until PG19, because of ABI (the patch adds a field into BTScanPosData, but maybe it'd be possible to add it into padding, not sure). regards -- Tomas Vondra
On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote: > So better to get this in now, otherwise we may have to wait until PG19, > because of ABI (the patch adds a field into BTScanPosData, but maybe > it'd be possible to add it into padding, not sure). I agree. I can get this into shape for commit today. Does anybody object to my proceeding with committing the patch on the master branch/putting it in Postgres 18? (FWIW I could probably fit the new field into some BTScanPosData alignment padding, but I don't favor back patching.) I consider my patch to be low risk. There's a kind of symmetry to how things work with the patch in place, which IMV makes things simpler. -- Peter Geoghegan
On Wed, Jun 4, 2025 at 10:21 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote: > > So better to get this in now, otherwise we may have to wait until PG19, > > because of ABI (the patch adds a field into BTScanPosData, but maybe > > it'd be possible to add it into padding, not sure). > > I agree. I can get this into shape for commit today. Attached is v3, which is functionally the same as v2. It has improved comments, and a couple of extra assertions. Plus there's a real commit message now. I also relocated the code that sets so.drop_pin from _bt_preprocess_keys to btrescan. That seems like the more logical place for it. My current plan is to commit this in the next couple of days. -- Peter Geoghegan
Attachment
On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > My current plan is to commit this in the next couple of days. Pushed. It would be great if we could also teach BufferGetLSNAtomic() to just call PageGetLSN() (at least on PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY platforms), but my sense is that that's not going to happen in Postgres 18. -- Peter Geoghegan