Thread: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica

Hi, I have recently encountered the problem, as title, on some of my PostgreSQL 9.6.1 hot standby replicas when serving readonly queries.

The problem makes a lot of PostgreSQL processes stuck on waiting for MultiXactOffsetControlLock and eventually runs out of the max_connections quota.

For example:
mydb=# select count(*) from pg_stat_activity where state = 'active' and wait_event = 'MultiXactOffsetControlLock';
 count
------
    956
(1 row)

I have tried use pg_terminate_backend to kill the stuck prcoesses, but no help. They still stuck on waiting for MultiXactOffsetControlLock even if the pg_terminate_backend(pid) returns true.

Currently, the only thing I can do to resolve the problem is to restart my replicas.

Can someone help me find out how could the problem occurred and is there any other way to resolve the problem?

I have also recorded some strace and perf top output on stuck processes:

strace:
strace: Process 555 attached
select(0, NULL, NULL, NULL, {0, 78}       = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(61734963, [{14, -1, 0}, 1])             = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(61505580, [{12, -1, 0}, 1])             = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(63471720, [{9, -1, 0}, 1])               = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(63471720, [{9, -1, 0}, 1])               = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0

perf top:
Samples: 208  of event 'cpu-clock', Event count (approx.): 36045672
Overhead  Shared Object   Symbol
    30.74%  postgres            [.] perform_spin_delay
    13.70%  [kernel]              [k] finish_task_switch
    13.66%  [kernel]              [k] 0x00007fff81838f35
      8.59%  postgres            [.] 0x0000000000319107
      4.02%  postgres            [.] 0x0000000000318f04
      2.47%  postgres            [.] pg_usleep
      2.10%  libc-2.23.so       [.] __select
      1.68%  postgres            [.] PGSemaphoreLock
      1.68%  postgres            [.] 0x0000000000318eb2
      1.54%  [kernel]              [k] pvclock_clocksource_read
      1.53%  postgres            [.] GetMultiXactIdMembers
      1.41%  postgres            [.] 0x0000000000318ea4
      1.31%  postgres            [.] LWLockAcquire
      1.30%  postgres            [.] 0x0000000000318e17
      1.23%  [kernel]              [k] do_select
      1.07%  postgres            [.] finish_spin_delay
      1.06%  [kernel]              [k] poll_freewait
      1.05%  libc-2.23.so       [.] semop
      0.96%  [kernel]              [k] poll_schedule_timeout
      0.72%  [kernel]              [k] SYSC_semtimedop
      0.69%  [kernel]              [k] core_sys_select
      0.69%  [kernel]              [k] pid_vnr
      0.61%  [kernel]              [k] dev_gro_receive
      0.61%  libc-2.23.so       [.] random
      0.61%  postgres            [.] 0x0000000000318e3d
      0.53%  [kernel]              [k] hrtimer_active
      0.53%  [kernel]              [k] ipc_obtain_object_check
      0.46%  postgres            [.] 0x0000000000318f00
      0.46%  postgres            [.] 0x0000000000319100
      0.46%  postgres            [.] 0x0000000000319103
      0.41%  postgres            [.] LWLockRelease
      0.41%  postgres            [.] 0x0000000000318f85
      0.36%  [kernel]              [k] run_timer_softirq
      0.36%  [kernel]              [k] sys_semop
Hi!

> 1 авг. 2021 г., в 00:30, Rui An <rueian.huang@gmail.com> написал(а):
>
> Hi, I have recently encountered the problem, as title, on some of my PostgreSQL 9.6.1 hot standby replicas when
servingreadonly queries. 
>
> The problem makes a lot of PostgreSQL processes stuck on waiting for MultiXactOffsetControlLock and eventually runs
outof the max_connections quota. 
>
> For example:
> mydb=# select count(*) from pg_stat_activity where state = 'active' and wait_event = 'MultiXactOffsetControlLock';
>  count
> ------
>     956
> (1 row)
>
> I have tried use pg_terminate_backend to kill the stuck prcoesses, but no help. They still stuck on waiting for
MultiXactOffsetControlLockeven if the pg_terminate_backend(pid) returns true. 
>
> Currently, the only thing I can do to resolve the problem is to restart my replicas.
>
> Can someone help me find out how could the problem occurred and is there any other way to resolve the problem?
>

I highly recommend you to update PG to 9.6.22. Your system lacks almost 5 years of updates. This may or may not be
relatedto the issue you observe, but still... 

I suspect that you observe effects of this sleep [0]. Kyotaro Horiguchi wrote draft patch to fix this [1], but I
haven'tput enough efforts to make it commit-ready yet. 
I think that you can mitigate the problem by increasing size of MultixactOffset buffers here [2]. I hope in future this
settingswill be configurable, thread [1] is just about this. 

Thanks!

Best regards, Andrey Borodin.

[0] https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/access/transam/multixact.c#L1289-L1319
[1]
https://www.postgresql.org/message-id/flat/20200515.090333.24867479329066911.horikyota.ntt%40gmail.com#855f8bb7205890579a363d2344b4484d
[2] https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/include/access/multixact.h#L32-L33


On Saturday, July 31, 2021, Rui An <rueian.huang@gmail.com> wrote:

I have tried use pg_terminate_backend to kill the stuck prcoesses, but no help. They still stuck on waiting for MultiXactOffsetControlLock even if the pg_terminate_backend(pid) returns true.

All the true is telling you is your process sent the requested signal.  It, and related functions, provide no insight as to if/how receivers of said signal may have responded.

As for the question at hand - the axiom is “if it ain’t broke don’t fix it”…but as your system is basically broke you need to fix it by running a supported version of the software.

And you should also be prepared for the possibility that 9.6 is discontinued before there is a fix in which case you’d have to upgrade to v10 (at least) to get said fix.

David J.

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> And you should also be prepared for the possibility that 9.6 is
> discontinued before there is a fix in which case you’d have to upgrade to
> v10 (at least) to get said fix.

If the fix is as complex as what's being discussed over at [1], there
is zero chance that it'd be back-patched at all.

Nonetheless, still being on 9.6.1 at this late date is next door to
software malpractice.  There are a *lot* of known bugs in that version
that have been fixed, including performance issues [2].

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86%40yandex-team.ru
[2] https://www.postgresql.org/docs/release/