Home > mailing lists

Re: Addressing buffer private reference count scalability issue - Mailing list pgsql-hackers

From	Alexandre Felipe
Subject	Re: Addressing buffer private reference count scalability issue
Date	March 12 01:41:25
Msg-id	CAE8JnxP=oPCmZs70VQ6U=35uLceqJVBQz3qakPR79cSkt7HU-g@mail.gmail.com Whole thread
In response to	Re: Addressing buffer private reference count scalability issue (Andres Freund <andres@anarazel.de>)
Responses	Re: Addressing buffer private reference count scalability issue
List	pgsql-hackers

Tree view

On Wed, Mar 11, 2026 at 6:55 PM Andres Freund <andres@anarazel.de> wrote:

Nice numbers!

It'd be good to also evaluate in the context of queries, as such a focused
microbenchmark will have much higher L1/L2 cache hit ratios than workloads
that actually look at data

I ran with the query you first used to illustrate the issue, and it was slower.
Went back and tried a strict copy of the code and it was 3% slower.

passing CFLAGS += -falign-functions=64 this shows no degradation.

Maybe LTO is what we need here?

It's not surprising it's worse, I just don't see how we could get away without
some mixing.

I will keep that for the future.

It's not at all crazy to have accesss patterns that just differ
in the higher bits of the buffer number (e.g. a nestloop join where the inner
side always needs a certain number of buffers). We need to be somewhat
resistant to that causing a lot of collisions (which will trigger a hashtable
growth cycle, without that necessarily fixing the issue!).

Yes, I understand that possibility.

> I brought back the array, but I eliminated the linear search.

Why? In my benchmarks allowing vectorization helped a decent amount in real
queries, because it does away with all the branch misses.

I basically treated the array as a hash table with a weak hash function and

delegated collisions to simplehash.

In the worse case would do a simple array[buffer & mask], never fails with

one branch, and would fail in ~3% of the cases with two branches.

And would eliminate the branches necessary to check the 1-entry cache.

But notice that this was the last patch, if you like everything except that

it is just a matter of picking 02, 03, 04.

> 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.
>
> 2a. the dynamic array case
> REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
> will grow the array when it reaches a certain level of occupation.
> I have set the default occupation level to 86% so that, if enabled, for a
> random input it will grow when we have about 2*size pins in total.
> If we find a sequential pattern then it will grow without growing the hash
> table.
> For the array lookup I don't use a hash, so for small number of pins
> it will be very fast.

I doubt it makes sense to basically have two levels of hash tables.

> 2b. the static case
> REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
> will use a static array, just as we had before and will not perform the
> linear
> search. It still has to read the size and do mask input.
>
> I tested the 4 variations and the winner is with the static array without
> the cache for the last entry.
> I increased the array size from 8 to 32, since you suggested before that
> that this could help. At that point it would have the tradeoff of a longer
> linear search, so it may help even more now.

Does your benchmark actually test repeated accesses to the same refcount? As
mentioned, those are very very common (access sequences like Pin, (Lock,
Unlock)+, Unpin are extremely common).

Yes that is the case a FIFO with one buffer, I didn't include the lock unlock tests here.
I did it a some point and it

I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as
that does not *at all* communicate that this is backend private state.

I suspected that, GetEntryForPrivateReferenceCountOfSharedBuffer would be

more accurate right?

I probably will stick with the original names.

I'd strongly advise separating moving code from a large scale rename. I
certainly won't waste time trying to see the difference between what was just
moved and what was changed at the same time.

Would you prefer not moving the code at all? One of the main reasons

for this was the changes in data structure on the patch 04, that I will not

include in the next version.

> diff --git a/.gitignore b/.gitignore
> index 4e911395fe..fddb7f861d 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -43,3 +43,5 @@ lib*.pc
> /Release/
> /tmp_install/
> /portlock/
> +
> +.*
> \ No newline at end of file

What? Certainly not.

Do you mean, we should certainly not exclude hidden files from git?

I usually build with prefix postgresql/.build/patch-*/

Then whenever I checkout something I have to keep adding this again.

I don't think it's a good idea to introduce new simplehash infrastructure as
part of this larger change.

Do you think it is worth doing that as a separate patch? Then we get it

out of the way on this that probably will go a few more versions?

You also haven't documented the new stuff.

Do you mean as source comments, or is there a separate documentation

for this?

> The previous implementation used an 8-bytes (64-bit) entry to store
> a uint32 count and an uint32 lock mode. That is fine when we store
> the data separate from the key (buffer). But in the simplehash
> {key, value} are stored together, so each entry is 12-bytes.
> This is somewhat awkward as we have to either pad the entry to 16-bytes,
> or the access will be an alternating aligned/misaligned addreses.
>
> Lock can assume only 4 values, and 2^30 is a decent limit for the
> number of pins on a single buffer. So this change is packing the
> {count[31:2], lock[1:0]} into a single uint32.
>
> Incrementing/decrementing the count continue the same, just using
> 4 instead of 1, lock mode access will require one or two additional
> bitwise operations. The exact count requires one shift, and is used
> only for debugging. A special function is provided to check whether
> count == 1.

Have you actually evaluated the benefit from this? Pretty sceptical it's worth
it.

I tested, and I agree, not worth it from a speed perspective.

At this point the only part left is the introduction of the simplehash.

However what I will try is to store just the buffer number in the hash

and keep another array for the entries, who knows that works better.

pgsql-hackers by date:

From: Alexander Korotkov
Date: 12 March, 01:41:10
Subject: Re: Odd code around ginScanToDelete

From: Alexander Korotkov
Date: 12 March, 01:51:47
Subject: Re: Odd code around ginScanToDelete

Re: Addressing buffer private reference count scalability issue - Mailing list pgsql-hackers

Previous

Next