Re: Slow standby snapshot - Mailing list pgsql-hackers

From Michail Nikolaev
Subject Re: Slow standby snapshot
Date
Msg-id CANtu0oh_ytfAgRYOSfQP49eFZv7qRFH+zdDB9=Bz0e7DQj5VUA@mail.gmail.com
Whole thread Raw
In response to Re: Slow standby snapshot  (Kirill Reshke <reshkekirill@gmail.com>)
Responses Re: Slow standby snapshot
List pgsql-hackers
)Hello.

> I recently ran into a problem in one of our production postgresql cluster.
> I had noticed lock contention on procarray lock on standby, which causes WAL
> replay lag growth.

Yes, I saw the same issue on my production cluster.

> 1) set max_connections to big number, like 100000

I made the tests with a more realistic value - 5000. It is valid value
for Amazon RDS for example (default is
LEAST({DBInstanceClassMemory/9531392}, 5000)).

The test looks like this:

pgbench -i -s 10 -U postgres -d postgres
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
long transaction on primary - begin;select txid_current();
perf top -p <pid of some standby>

So, on postgres 14 (master) non-patched version looks like this:

   5.13%  postgres            [.] KnownAssignedXidsGetAndSetXmin
   4.61%  postgres            [.] pg_checksum_block
   2.54%  postgres            [.] AllocSetAlloc
   2.44%  postgres            [.] base_yyparse

It is too much to spend 5-6% of CPU running throw an array :) I think
it should be fixed for both the 13 and 14 versions.

The patched version like this (was unable to notice
KnownAssignedXidsGetAndSetXmin):

   3.08%  postgres            [.] pg_checksum_block
   2.89%  postgres            [.] AllocSetAlloc
   2.66%  postgres            [.] base_yyparse
   2.00%  postgres            [.] MemoryContextAllocZeroAligned

On postgres 13 non patched version looks even worse (definitely need
to be fixed in my opinion):

  26.44%  postgres           [.] KnownAssignedXidsGetAndSetXmin
   2.17%  postgres            [.] base_yyparse
   2.01%  postgres            [.] AllocSetAlloc
   1.55%  postgres            [.] MemoryContextAllocZeroAligned

But your patch does not apply to REL_13_STABLE. Could you please
provide two versions?

Also, there are warnings while building with patch:

        procarray.c:4595:9: warning: ISO C90 forbids mixed
declarations and code [-Wdeclaration-after-statement]
        4595 |         int prv = -1;
             |         ^~~
        procarray.c: In function ‘KnownAssignedXidsGetOldestXmin’:
        procarray.c:5056:5: warning: variable ‘tail’ set but not used
[-Wunused-but-set-variable]
        5056 |     tail;
             |     ^~~~
        procarray.c:5067:38: warning: ‘i’ is used uninitialized in
this function [-Wuninitialized]
        5067 |         i = KnownAssignedXidsValidDLL[i].nxt;


Some of them are clear errors, so, please recheck the code.

Also, maybe it is better to reduce the invasivity by using a more
simple approach. For example, use the first bit to mark xid as valid
and the last 7 bit (128 values) as an optimistic offset to the next
valid xid (jump by 127 steps in the worse scenario).
What do you think?

Also, it is a good idea to register the patch in the commitfest app
(https://commitfest.postgresql.org/).

Thanks,
Michail.



pgsql-hackers by date:

Previous
From: Mikael Kjellström
Date:
Subject: Re: Race condition in recovery?
Next
From: "Jonathan S. Katz"
Date:
Subject: Re: unnesting multirange data types