Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica - Mailing list pgsql-bugs

From Rui An
Subject Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica
Date
Msg-id CABjTVMR39ci6LUVE2yWhpcufPee4nO3cu32sij3Yh8viksCztQ@mail.gmail.com
Whole thread Raw
Responses Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica  (Andrey Borodin <x4mmm@yandex-team.ru>)
Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica  ("David G. Johnston" <david.g.johnston@gmail.com>)
List pgsql-bugs
Hi, I have recently encountered the problem, as title, on some of my PostgreSQL 9.6.1 hot standby replicas when serving readonly queries.

The problem makes a lot of PostgreSQL processes stuck on waiting for MultiXactOffsetControlLock and eventually runs out of the max_connections quota.

For example:
mydb=# select count(*) from pg_stat_activity where state = 'active' and wait_event = 'MultiXactOffsetControlLock';
 count
------
    956
(1 row)

I have tried use pg_terminate_backend to kill the stuck prcoesses, but no help. They still stuck on waiting for MultiXactOffsetControlLock even if the pg_terminate_backend(pid) returns true.

Currently, the only thing I can do to resolve the problem is to restart my replicas.

Can someone help me find out how could the problem occurred and is there any other way to resolve the problem?

I have also recorded some strace and perf top output on stuck processes:

strace:
strace: Process 555 attached
select(0, NULL, NULL, NULL, {0, 78}       = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(61734963, [{14, -1, 0}, 1])             = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(61505580, [{12, -1, 0}, 1])             = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(63471720, [{9, -1, 0}, 1])               = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0
semop(63471720, [{9, -1, 0}, 1])               = 0
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}   = 0 (Timeout)
semop(60981276, [{0, -1, 0}, 1])               = 0

perf top:
Samples: 208  of event 'cpu-clock', Event count (approx.): 36045672
Overhead  Shared Object   Symbol
    30.74%  postgres            [.] perform_spin_delay
    13.70%  [kernel]              [k] finish_task_switch
    13.66%  [kernel]              [k] 0x00007fff81838f35
      8.59%  postgres            [.] 0x0000000000319107
      4.02%  postgres            [.] 0x0000000000318f04
      2.47%  postgres            [.] pg_usleep
      2.10%  libc-2.23.so       [.] __select
      1.68%  postgres            [.] PGSemaphoreLock
      1.68%  postgres            [.] 0x0000000000318eb2
      1.54%  [kernel]              [k] pvclock_clocksource_read
      1.53%  postgres            [.] GetMultiXactIdMembers
      1.41%  postgres            [.] 0x0000000000318ea4
      1.31%  postgres            [.] LWLockAcquire
      1.30%  postgres            [.] 0x0000000000318e17
      1.23%  [kernel]              [k] do_select
      1.07%  postgres            [.] finish_spin_delay
      1.06%  [kernel]              [k] poll_freewait
      1.05%  libc-2.23.so       [.] semop
      0.96%  [kernel]              [k] poll_schedule_timeout
      0.72%  [kernel]              [k] SYSC_semtimedop
      0.69%  [kernel]              [k] core_sys_select
      0.69%  [kernel]              [k] pid_vnr
      0.61%  [kernel]              [k] dev_gro_receive
      0.61%  libc-2.23.so       [.] random
      0.61%  postgres            [.] 0x0000000000318e3d
      0.53%  [kernel]              [k] hrtimer_active
      0.53%  [kernel]              [k] ipc_obtain_object_check
      0.46%  postgres            [.] 0x0000000000318f00
      0.46%  postgres            [.] 0x0000000000319100
      0.46%  postgres            [.] 0x0000000000319103
      0.41%  postgres            [.] LWLockRelease
      0.41%  postgres            [.] 0x0000000000318f85
      0.36%  [kernel]              [k] run_timer_softirq
      0.36%  [kernel]              [k] sys_semop

pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data
Next
From: Andrey Borodin
Date:
Subject: Re: Many queries stuck for waiting MultiXactOffsetControlLock on PostgreSQL 9.6.1 replica