Thread: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
From
Rui An
Date:
Hi, I have recently encountered the problem, as title, on some of my PostgreSQL 9.6.1 hot standby replicas when serving readonly queries.
The problem makes a lot of PostgreSQL processes stuck on waiting for MultiXactOffsetControlLock and eventually runs out of the max_connections quota.
For example:
mydb=# select count(*) from pg_stat_activity where state = 'active' and wait_event = 'MultiXactOffsetControlLock';
count
------
956
(1 row)
I have tried use pg_terminate_backend to kill the stuck prcoesses, but no help. They still stuck on waiting for MultiXactOffsetControlLock even if the pg_terminate_backend(pid) returns true.
Currently, the only thing I can do to resolve the problem is to restart my replicas.
Can someone help me find out how could the problem occurred and is there any other way to resolve the problem?
I have also recorded some strace and perf top output on stuck processes:
strace:
strace: Process 555 attached
select(0, NULL, NULL, NULL, {0, 78} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1]) = 0
semop(60981276, [{0, -1, 0}, 1]) = 0
semop(61734963, [{14, -1, 0}, 1]) = 0
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
semop(61505580, [{12, -1, 0}, 1]) = 0
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1]) = 0
semop(63471720, [{9, -1, 0}, 1]) = 0
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1]) = 0
semop(63471720, [{9, -1, 0}, 1]) = 0
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000} = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1]) = 0
perf top:
Samples: 208 of event 'cpu-clock', Event count (approx.): 36045672
Overhead Shared Object Symbol
30.74% postgres [.] perform_spin_delay
13.70% [kernel] [k] finish_task_switch
13.66% [kernel] [k] 0x00007fff81838f35
8.59% postgres [.] 0x0000000000319107
4.02% postgres [.] 0x0000000000318f04
2.47% postgres [.] pg_usleep
2.10% libc-2.23.so [.] __select
1.68% postgres [.] PGSemaphoreLock
1.68% postgres [.] 0x0000000000318eb2
1.54% [kernel] [k] pvclock_clocksource_read
1.53% postgres [.] GetMultiXactIdMembers
1.41% postgres [.] 0x0000000000318ea4
1.31% postgres [.] LWLockAcquire
1.30% postgres [.] 0x0000000000318e17
1.23% [kernel] [k] do_select
1.07% postgres [.] finish_spin_delay
1.06% [kernel] [k] poll_freewait
1.05% libc-2.23.so [.] semop
0.96% [kernel] [k] poll_schedule_timeout
0.72% [kernel] [k] SYSC_semtimedop
0.69% [kernel] [k] core_sys_select
0.69% [kernel] [k] pid_vnr
0.61% [kernel] [k] dev_gro_receive
0.61% libc-2.23.so [.] random
0.61% postgres [.] 0x0000000000318e3d
0.53% [kernel] [k] hrtimer_active
0.53% [kernel] [k] ipc_obtain_object_check
0.46% postgres [.] 0x0000000000318f00
0.46% postgres [.] 0x0000000000319100
0.46% postgres [.] 0x0000000000319103
0.41% postgres [.] LWLockRelease
0.41% postgres [.] 0x0000000000318f85
0.36% [kernel] [k] run_timer_softirq
0.36% [kernel] [k] sys_semop
Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
From
Andrey Borodin
Date:
Hi! > 1 авг. 2021 г., в 00:30, Rui An <rueian.huang@gmail.com> написал(а): > > Hi, I have recently encountered the problem, as title, on some of my PostgreSQL 9.6.1 hot standby replicas when servingreadonly queries. > > The problem makes a lot of PostgreSQL processes stuck on waiting for MultiXactOffsetControlLock and eventually runs outof the max_connections quota. > > For example: > mydb=# select count(*) from pg_stat_activity where state = 'active' and wait_event = 'MultiXactOffsetControlLock'; > count > ------ > 956 > (1 row) > > I have tried use pg_terminate_backend to kill the stuck prcoesses, but no help. They still stuck on waiting for MultiXactOffsetControlLockeven if the pg_terminate_backend(pid) returns true. > > Currently, the only thing I can do to resolve the problem is to restart my replicas. > > Can someone help me find out how could the problem occurred and is there any other way to resolve the problem? > I highly recommend you to update PG to 9.6.22. Your system lacks almost 5 years of updates. This may or may not be relatedto the issue you observe, but still... I suspect that you observe effects of this sleep [0]. Kyotaro Horiguchi wrote draft patch to fix this [1], but I haven'tput enough efforts to make it commit-ready yet. I think that you can mitigate the problem by increasing size of MultixactOffset buffers here [2]. I hope in future this settingswill be configurable, thread [1] is just about this. Thanks! Best regards, Andrey Borodin. [0] https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/access/transam/multixact.c#L1289-L1319 [1] https://www.postgresql.org/message-id/flat/20200515.090333.24867479329066911.horikyota.ntt%40gmail.com#855f8bb7205890579a363d2344b4484d [2] https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/include/access/multixact.h#L32-L33
Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
From
"David G. Johnston"
Date:
On Saturday, July 31, 2021, Rui An <rueian.huang@gmail.com> wrote:
I have tried use pg_terminate_backend to kill the stuck prcoesses, but no help. They still stuck on waiting for MultiXactOffsetControlLock even if the pg_terminate_backend(pid) returns true.
All the true is telling you is your process sent the requested signal. It, and related functions, provide no insight as to if/how receivers of said signal may have responded.
As for the question at hand - the axiom is “if it ain’t broke don’t fix it”…but as your system is basically broke you need to fix it by running a supported version of the software.
And you should also be prepared for the possibility that 9.6 is discontinued before there is a fix in which case you’d have to upgrade to v10 (at least) to get said fix.
David J.
Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes: > And you should also be prepared for the possibility that 9.6 is > discontinued before there is a fix in which case you’d have to upgrade to > v10 (at least) to get said fix. If the fix is as complex as what's being discussed over at [1], there is zero chance that it'd be back-patched at all. Nonetheless, still being on 9.6.1 at this late date is next door to software malpractice. There are a *lot* of known bugs in that version that have been fixed, including performance issues [2]. regards, tom lane [1] https://www.postgresql.org/message-id/flat/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86%40yandex-team.ru [2] https://www.postgresql.org/docs/release/