Re: MultiXact\SLRU buffers configuration - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: MultiXact\SLRU buffers configuration
Date
Msg-id CA+hUKGKEdeVKpOqBxzRF+Z4=j0T+CA7ERrXni5De71Mm6-dBWA@mail.gmail.com
Whole thread Raw
In response to Re: MultiXact\SLRU buffers configuration  (Andrey Borodin <x4mmm@yandex-team.ru>)
Responses Re: MultiXact\SLRU buffers configuration
List pgsql-hackers
On Thu, Apr 1, 2021 at 10:09 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 29 марта 2021 г., в 11:26, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
> > My TODO list:
> > 1. Try to break patch set v13-[0001-0004]
> > 2. Think how to measure performance of linear search versus hash search in SLRU buffer mapping.
>
> Hi Thomas!
> I'm still doing my homework. And to this moment all my catch is that "utils/dynahash.h" is not necessary.

Thanks.  Here's a new patch with that fixed, and also:

1.  New names ("... Mapping Table" -> "... Lookup Table" in
pg_shmem_allocations view, "clog_buffers" -> "xact_buffers") and a
couple of typos fixed here and there.

2.  Remove the cap of 128 buffers for xact_buffers as agreed.  We
still need a cap though, to avoid a couple of kinds of overflow inside
slru.c, both when computing the default value and accepting a
user-provided number.  I introduced SLRU_MAX_ALLOWED_BUFFERS to keep
it <= 1GB, and tested this on a 32 bit build with extreme block sizes.

Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but
only if you have track_commit_timestamp enabled.  It seems silly to
waste 1MB per 1GB of shared_buffers on a feature you're not using.  So
the default is capped at 16 in that case to preserve existing
behaviour, but otherwise can be set very large if you want.

I think it's plausible that we'll want to make the multixact sizes
adaptive too, but I that might have to be a job for later.  Likewise,
I am sure that substransaction-heavy workloads might be slower than
they need to be due to the current default, but I have not done the
research,  With these new GUCs, people are free to experiment and
develop theories about what the defaults should be in v15.

3.  In the special case of xact_buffers, there is a lower cap of
512MB, because there's no point trying to cache more xids than there
can be in existence, and that is computed by working backwards from
CLOG_XACTS_PER_PAGE etc.  It's not possible to do the same sort of
thing for the other SLRUs without overflow problems, and it doesn't
seem worth trying to fix that right now (1GB of cached commit
timestamps ought to be enough for anyone™, while the theoretical
maximum is 10 bytes for 2b xids = 20GB).

To make this more explicit for people not following our discussion in
detail: with shared_buffers=0...512MB, the behaviour doesn't change.
But for shared_buffers=1GB you'll get twice as many xact_buffers as
today (2MB instead of being capped at 1MB), and it keeps scaling
linearly from there at 0.2%.  In other words, all real world databases
will get a boost in this department.

4.  Change the default for commit_ts_buffers back to shared_buffers /
1024 (with a minimum of 4), because I think you might have changed it
by a copy and paste error -- or did you intend to make the default
higher?

5.  Improve descriptions for the GUCs, visible in pg_settings view, to
match the documentation for related wait events.  So "for commit log
SLRU" -> "for the transaction status SLRU cache", and similar
corrections elsewhere.  (I am tempted to try to find a better word
than "SLRU", which doesn't seem relevant to users, but for now
consistency is good.)

6.  Added a callback so that SHOW xact_buffers and SHOW
commit_ts_buffers display the real size in effect (instead of "0" for
default).

I tried running with xact_buffers=1 and soon saw why you change it to
interpret 1 the same as 0; with 1 you hit buffer starvation and get
stuck.  I wish there were some way to say "the default for this GUC is
0, but if it's not zero then it's got to be at least 4".  I didn't
study the theoretical basis for the previous minimum value of 4, so I
think we should keep it that way, so that if you say 3 you get 4.  I
thought it was better to express that like so:

    /* Use configured value if provided. */
    if (xact_buffers > 0)
        return Max(4, xact_buffers);
    return Min(CLOG_MAX_ALLOWED_BUFFERS, Max(4, NBuffers / 512));

> I'm thinking about hashtables and measuring performance near optimum of linear search does not seem a good idea now.
> It's impossible to prove that difference is statistically significant on all platforms. But even on one platform
measurementsare just too noisy. 
>
> Shared buffers lookup table is indeed very similar to this SLRU lookup table. And it does not try to use more memory
thanneeded. I could not find pgbench-visible impact of growing shared buffer lookup table. Obviously, because it's not
abottleneck on regular workload. And it's hard to guess representative pathological workload. 

Thanks for testing.  I agree that it's a good idea to follow the main
buffer pool's approach for our first version of this.  One small
difference is that the SLRU patch performs the hash computation while
it holds the lock.  If we computed the hash first and used
hash_search_with_hash_value(), we could compute it before we obtain
the lock, like the main buffer pool.

If we computed the hash value first, we could also ignore the rule in
the documentation for hash_search_with_hash_value() that says that you
must calculate it with get_hash_value(), and just call the hash
function ourselves, so that it's fully inlinable.  The same
opportunity exists for the main buffer pool.  That'd get you one of
the micro-optimisations that simplehash.h offers.  Whether it's worth
bothering with, I don't know.

> In fact, this thread started with proposal to use reader-writer lock for multis (instead of exclusive lock), and this
proposalencountered same problem. It's very hard to create stable reproduction of pathological workload when this lock
isheavily contented. Many people observed the problem, but still there is no open repro. 
>
> I bet it will be hard to prove that simplehash is any better then HTAB. But if it is really better, shared buffers
couldbenefit from the same technique. 

Agreed, there are a lot of interesting future projects in this area,
when you compare the main buffer pool, these special buffer pools, and
maybe also the "shared relfilenode" pool patch I have proposed for v15
(CF entry 2933).  All have mapping tables and buffer replacement
algorithms (why should they be different?), one has partitions, some
have atomic-based header locks, they interlock with WAL differently
(on page LSN, FPIs and checksum support), ... etc etc.

> I think its just fine to use HTAB with normal size, as long as shared buffers do so. But there we allocate slightly
morespace InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS). Don't we need to allocate nslots + 1 ? It seems that we
alwaysdo SlruMappingRemove() before SlruMappingAdd() and it is not necessary. 

Yeah, we never try to add more elements than allowed, because we have
a big lock around the mapping.  The main buffer mapping table has a
more concurrent design and might temporarily have one extra entry per
partition.

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [PATCH] add concurrent_abort callback for output plugin
Next
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: Stronger safeguard for archive recovery not to miss data