Thread: BTScanOpaqueData size slows down tests
Hi, I was a bit annoyed at test times just now. Ran a profile on the entire regression tests in a cassert -Og build. Unsurprisingly most of the time is spent in AllocSetCheck(). I was mildly surprised to see how expensive the new compact attribute checks are. What I was more surprised to realize is how much of the time is spent in freeing and allocating BTScanOpaqueData. + 6.94% postgres postgres [.] AllocSetCheck - 4.96% postgres libc.so.6 [.] __memset_evex_unaligned_erms - 1.94% memset@plt - 1.12% _int_malloc - 1.11% malloc - 0.90% AllocSetAllocLarge - AllocSetAlloc - 0.77% palloc - 0.63% btbeginscan - index_beginscan_internal - 0.63% index_beginscan - 0.61% systable_beginscan + 0.22% SearchCatCacheMiss + 0.07% ScanPgRelation + 0.05% RelationBuildTupleDesc + 0.04% findDependentObjects 0.03% GetNewOidWithIndex + 0.02% deleteOneObject + 0.02% shdepDropDependency + 0.02% DeleteComments + 0.02% SearchCatCacheList + 0.02% DeleteSecurityLabel + 0.02% DeleteInitPrivs + 0.04% text_to_cstring + 0.02% cstring_to_text_with_len + 0.02% datumCopy + 0.02% tuplesort_begin_batch + 0.11% palloc_extended + 0.01% AllocSetRealloc + 0.20% AllocSetAllocFromNewBlock + 0.82% _int_free_merge_chunk - 1.90% __memset_evex_unaligned_erms - 1.82% wipe_mem - 1.33% AllocSetFree - 1.33% pfree + 0.73% btendscan + 0.22% freedfa + 0.06% ExecAggCopyTransValue + 0.04% freenfa + 0.03% enlarge_list + 0.03% ExecDropSingleTupleTableSlot + 0.02% xmlconcat + 0.01% RemoveLocalLock + 0.01% errcontext_msg + 0.01% IndexScanEnd 0.01% heap_free_minimal_tuple + 0.49% AllocSetReset 0.02% palloc0 0.01% PageInit + 0.01% wipe_mem + 0.59% alloc_perturb + 0.46% asm_exc_page_fault + 0.03% asm_sysvec_apic_timer_interrupt + 0.02% wipe_mem Looking at the size of BTScanOpaqueData I am less surprised: /* --- cacheline 1 boundary (64 bytes) --- */ char * currTuples; /* 64 8 */ char * markTuples; /* 72 8 */ int markItemIndex; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ BTScanPosData currPos __attribute__((__aligned__(8))); /* 88 13632 */ /* --- cacheline 214 boundary (13696 bytes) was 24 bytes ago --- */ BTScanPosData markPos __attribute__((__aligned__(8))); /* 13720 13632 */ /* size: 27352, cachelines: 428, members: 17 */ /* sum members: 27340, holes: 4, sum holes: 12 */ /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */ /* last cacheline: 24 bytes */ } __attribute__((__aligned__(8))); allocating, zeroing and freeing 28kB of memory for every syscache miss, yea, that's gonna hurt. The reason BTScanPosData is that large is that it stores MaxTIDsPerBTreePage* sizeof(BTScanPosItem): BTScanPosItem items[1358] __attribute__((__aligned__(2))); /* 48 13580 */ Could we perhaps allocate BTScanPosData->items dynamically if more than a handful of items are needed? And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole only when mark/restore are used? I'd be rather unsurprised if this isn't just an issue for tests, but also in a few real workloads. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Looking at the size of BTScanOpaqueData I am less surprised: > /* size: 27352, cachelines: 428, members: 17 */ > allocating, zeroing and freeing 28kB of memory for every syscache miss, yea, > that's gonna hurt. Ouch! I had no idea it had gotten that big. Yeah, we ought to do something about that. > And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole > only when mark/restore are used? That'd be an easy way of removing about half of the problem, but 14kB is still too much. How badly do we need this items array? Couldn't we just reference the on-page items? regards, tom lane
On Wed, Apr 2, 2025 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ouch! I had no idea it had gotten that big. Yeah, we ought to > do something about that. Tomas Vondra talked about this recently, in the context of his work on prefetching. > > And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole > > only when mark/restore are used? > > That'd be an easy way of removing about half of the problem, but > 14kB is still too much. How badly do we need this items array? > Couldn't we just reference the on-page items? I'm not sure what you mean by that. The whole design of _bt_readpage is based on the idea that we read a whole page, in one go. It has to batch up the items that are to be returned from the page somewhere. The worst case is that there are about 1350 TIDs to return from any single page (assuming default BLCKSZ). It's very pessimistic to start from the assumption that that worst case will be hit, but I don't see a way around doing it at least some of the time. The first thing I'd try is some kind of simple dynamic allocation scheme, with a small built-in array that avoided any allocation penalty in the common case where there weren't too many tuples to return from the page. The way that we allocate BLCKSZ twice for index-only scans (one for so->currTuples, the other for so->markTuples) is also pretty inefficient. Especially because any kind of use of mark and restore is exceedingly rare. -- Peter Geoghegan
Hi, On 2025-04-02 11:36:33 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Looking at the size of BTScanOpaqueData I am less surprised: > > /* size: 27352, cachelines: 428, members: 17 */ > > allocating, zeroing and freeing 28kB of memory for every syscache miss, yea, > > that's gonna hurt. > > Ouch! I had no idea it had gotten that big. Yeah, we ought to > do something about that. It got a bit bigger a few years back, in commit 0d861bbb702 Author: Peter Geoghegan <pg@bowt.ie> Date: 2020-02-26 13:05:30 -0800 Add deduplication to nbtree. Because the posting list is a lot more dense, more items can be stored on each page. Not that it was small before either: BTScanPosData currPos __attribute__((__aligned__(8))); /* 88 4128 */ /* --- cacheline 65 boundary (4160 bytes) was 56 bytes ago --- */ BTScanPosData markPos __attribute__((__aligned__(8))); /* 4216 4128 */ /* size: 8344, cachelines: 131, members: 16 */ /* sum members: 8334, holes: 3, sum holes: 10 */ /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */ /* last cacheline: 24 bytes */ } __attribute__((__aligned__(8))); But obviously ~3.2x can qualitatively change something. > > And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole > > only when mark/restore are used? > > That'd be an easy way of removing about half of the problem, but > 14kB is still too much. How badly do we need this items array? > Couldn't we just reference the on-page items? I think that'd require acquiring the buffer lock and/or pin more frequently. But I know very little about nbtree. I'd assume it's extremely rare for there to be this many items on a page. I'd guess that something like storing having BTScanPosData->items point to an in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed. I'm a bit confused by the "MUST BE LAST" comment: BTScanPosItem items[MaxTIDsPerBTreePage]; /* MUST BE LAST */ Not clear why? Seems to be from rather long back: commit 09cb5c0e7d6 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2006-05-07 01:21:30 +0000 Rewrite btree index scans to work a page at a time in all cases (both Greetings, Andres Freund
On Wed, Apr 2, 2025 at 11:57 AM Andres Freund <andres@anarazel.de> wrote: > I'd assume it's extremely rare for there to be this many items on a page. I'd > guess that something like storing having BTScanPosData->items point to an > in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a > full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed. There can still only be MaxIndexTuplesPerPage items on the page (407 if memory serves) -- deduplication didn't change that. It isn't at all rare for the scan to have to return about 1350 TIDs from a page, though. Any low cardinality index will tend to have almost that many TIDs to return on any page that only stores duplicates. And scan will necessarily have to return all of the TIDs from such a page, if it has to return any. -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes: > I'm a bit confused by the "MUST BE LAST" comment: > BTScanPosItem items[MaxTIDsPerBTreePage]; /* MUST BE LAST */ Yeah, me too. It looks like it might've been intended to allow the items[] array to be only partially allocated. But as Peter says, we don't do that. Too long ago to be sure :=( regards, tom lane
Hi, On 2025-04-02 12:01:57 -0400, Peter Geoghegan wrote: > On Wed, Apr 2, 2025 at 11:57 AM Andres Freund <andres@anarazel.de> wrote: > > I'd assume it's extremely rare for there to be this many items on a page. I'd > > guess that something like storing having BTScanPosData->items point to an > > in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a > > full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed. > > There can still only be MaxIndexTuplesPerPage items on the page (407 > if memory serves) -- deduplication didn't change that. Sure. > It isn't at all rare for the scan to have to return about 1350 TIDs > from a page, though. Any low cardinality index will tend to have > almost that many TIDs to return on any page that only stores > duplicates. And scan will necessarily have to return all of the TIDs > from such a page, if it has to return any. I'm not sure what you're arguing for/against here? Obviously we need to handle that case. I doubt that the overhead of once-per-scan allocation of a MaxTIDsPerBTreePage * sizeof(BTScanPosItem) array once per scan matters when that many tuples are returned. Greetings, Andres Freund
On Wed, Apr 2, 2025 at 12:08 PM Andres Freund <andres@anarazel.de> wrote: > > It isn't at all rare for the scan to have to return about 1350 TIDs > > from a page, though. Any low cardinality index will tend to have > > almost that many TIDs to return on any page that only stores > > duplicates. And scan will necessarily have to return all of the TIDs > > from such a page, if it has to return any. > > I'm not sure what you're arguing for/against here? Obviously we need to handle > that case. I doubt that the overhead of once-per-scan allocation of a > MaxTIDsPerBTreePage * sizeof(BTScanPosItem) array once per scan matters when > that many tuples are returned. I'm not arguing for or against anything. I'm just giving context. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > It isn't at all rare for the scan to have to return about 1350 TIDs > from a page, though. Any low cardinality index will tend to have > almost that many TIDs to return on any page that only stores > duplicates. And scan will necessarily have to return all of the TIDs > from such a page, if it has to return any. Agreed, but scans that just return one item are also very common, particularly in the syscache-miss case that Andres started with. I could get behind the idea of just having enough space in BTScanOpaqueData for about ten items, and dynamically allocating a MaxTIDsPerBTreePage-sized array only if we overrun that. And not allocate any space for mark/restore unless a mark is done. regards, tom lane
On Wed, Apr 2, 2025 at 12:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I could get behind the idea of just having enough space in > BTScanOpaqueData for about ten items, and dynamically allocating > a MaxTIDsPerBTreePage-sized array only if we overrun that. > And not allocate any space for mark/restore unless a mark is > done. That's exactly what I had in mind. -- Peter Geoghegan
On 2025-Apr-02, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I'm a bit confused by the "MUST BE LAST" comment: > > BTScanPosItem items[MaxTIDsPerBTreePage]; /* MUST BE LAST */ > > Yeah, me too. It looks like it might've been intended to allow > the items[] array to be only partially allocated. But as Peter > says, we don't do that. Too long ago to be sure :=( The mentioned commit (09cb5c0e7d6f) shows that we do partial memcpys, e.g. + memcpy(&so->markPos, &so->currPos, + offsetof(BTScanPosData, items[1]) + + so->currPos.lastItem * sizeof(BTScanPosItem)); That would obviously not work if this field weren't last. We still do that, don't we? See btrestrpos(). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Before you were born your parents weren't as boring as they are now. They got that way paying your bills, cleaning up your room and listening to you tell them how idealistic you are." -- Charles J. Sykes' advice to teenagers
On Wed, Apr 2, 2025 at 12:31 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > The mentioned commit (09cb5c0e7d6f) shows that we do partial memcpys, e.g. > > + memcpy(&so->markPos, &so->currPos, > + offsetof(BTScanPosData, items[1]) + > + so->currPos.lastItem * sizeof(BTScanPosItem)); > > That would obviously not work if this field weren't last. We still do > that, don't we? See btrestrpos(). Yes, we do -- but it's rare. It only happens in the case where we restore a mark after leaving the page where the mark was taken. In practice, merge joins tend to hardly ever do that (I know this because it makes testing annoyingly difficult). In other words, the optimization added by commit 08ae5edc5c seems to be very effective in practice. -- Peter Geoghegan
On 2025-Apr-02, Peter Geoghegan wrote: > On Wed, Apr 2, 2025 at 12:31 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > The mentioned commit (09cb5c0e7d6f) shows that we do partial memcpys, e.g. > > > > + memcpy(&so->markPos, &so->currPos, > > + offsetof(BTScanPosData, items[1]) + > > + so->currPos.lastItem * sizeof(BTScanPosItem)); > > > > That would obviously not work if this field weren't last. We still do > > that, don't we? See btrestrpos(). > > Yes, we do -- but it's rare. It only happens in the case where we > restore a mark after leaving the page where the mark was taken. In > practice, merge joins tend to hardly ever do that (I know this because > it makes testing annoyingly difficult). In other words, the > optimization added by commit 08ae5edc5c seems to be very effective in > practice. I'm just saying that this is the reason to have the field be last in the struct. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
On Wed, Apr 2, 2025 at 12:43 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I'm just saying that this is the reason to have the field be last in the > struct. Right. And I'm just saying that I don't think that it would take very much effort to change it. That aspect isn't performance critical. -- Peter Geoghegan
On 4/2/25 17:45, Peter Geoghegan wrote: > On Wed, Apr 2, 2025 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Ouch! I had no idea it had gotten that big. Yeah, we ought to >> do something about that. > > Tomas Vondra talked about this recently, in the context of his work on > prefetching. > I might have mentioned in the context of index prefetching (because that has to touch this, naturally), but I actually ran into this when working on the fast-path locking [1]. [1] https://www.postgresql.org/message-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com One of the tests I did was with partitions, and with an index scans on tiny partitions that got pretty awful simply because of malloc() calls. The struct exceeds ALLOCSET_SEPARATE_THRESHOLD, so it can't be cached, and even if it could we would not cache it across scans anyway. >>> And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole >>> only when mark/restore are used? >> >> That'd be an easy way of removing about half of the problem, but >> 14kB is still too much. How badly do we need this items array? >> Couldn't we just reference the on-page items? > > I'm not sure what you mean by that. The whole design of _bt_readpage > is based on the idea that we read a whole page, in one go. It has to > batch up the items that are to be returned from the page somewhere. > The worst case is that there are about 1350 TIDs to return from any > single page (assuming default BLCKSZ). It's very pessimistic to start > from the assumption that that worst case will be hit, but I don't see > a way around doing it at least some of the time. > > The first thing I'd try is some kind of simple dynamic allocation > scheme, with a small built-in array that avoided any allocation > penalty in the common case where there weren't too many tuples to > return from the page. > > The way that we allocate BLCKSZ twice for index-only scans (one for > so->currTuples, the other for so->markTuples) is also pretty > inefficient. Especially because any kind of use of mark and restore is > exceedingly rare. > Yeah, something like this (allocating smaller arrays unless more is actually needed) would help many common cases. Another thing that helped was setting MALLOC_TOP_PAD_ env variable (or the same thing using mallopt), so that glibc keeps "buffer" for future allocations. regards -- Tomas Vondra
On Thu, 3 Apr 2025 at 04:21, Andres Freund <andres@anarazel.de> wrote: > I was mildly > surprised to see how expensive the new compact attribute checks are. Is this a fairly deform-heavy workload? FWIW, before adding that Assert, I did test to see if there was any measurable impact on the time it took to run all the tests [1] and I couldn't measure it. I feel having something to check these are in sync is worthwhile. It did find bugs during the development of the patch. David [1] https://postgr.es/m/CAApHDvpHmUuFY9rtRw6oiG7OTM-o%2BBnODg1m-EynKGbbhSsMzA%40mail.gmail.com
Hi, On 2025-04-03 11:26:08 +1300, David Rowley wrote: > On Thu, 3 Apr 2025 at 04:21, Andres Freund <andres@anarazel.de> wrote: > > I was mildly > > surprised to see how expensive the new compact attribute checks are. > > Is this a fairly deform-heavy workload? It was our regression tests, although I was playing around with combining various groups in the schedule to get the tests to pass faster overall (on average we just use 2.8 cores, which seems depressingly low). I do see a about a 6-7% increase in test time with just the default schedule. > I feel having something to check these are in sync is worthwhile. It did > find bugs during the development of the patch. I agree. But I wonder if we couldn't make it a bit smarter than right now. Afaict we'll recheck it over and over, which does seem a bit extreme. When running queries that deform a lot, the overhead is rather substantial. With pgbench of f <(echo 'SELECT sum(pronargs) FROM pg_proc;') I see 2155 TPS with the call to verify_compact_attribute() commented out and 827 TPS without. Greetings, Andres Freund