Thread: Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2parallel gathers

Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2parallel gathers

From
Chris Snook
Date:
I'll start by saying that I don't have as much information as I'd like to make a proper bug report, but I want to throw this out there in case anyone else has seen something similar and can point me in the right direction for reproducing it.

I have a Debian Stretch system running the 9.6.10 build from Debian APT repos. It has been running flawlessly, lightly loaded, for about a year and a half, only being restarted for security updates. It recently crashed twice about an hour apart due to postmaster receiving a signal from a worker task, both times with this complaint in the log:

FATAL:  no free slots in PMChildFlags array

For more context, I got these tightly packed around the first crash, with the first and last messages repeated hundreds of times:

FATAL:  sorry, too many clients already
FATAL:  sorry, too many clients already
FATAL:  sorry, too many clients already
FATAL:  no free slots in PMChildFlags array
WARNING:  could not remove shared memory segment "/PostgreSQL.1407760088": No such file or directory
FATAL:  semop(id=2293786) failed: Invalid argument
FATAL:  semop(id=2293786) failed: Invalid argument
FATAL:  semctl(2064403, 7, SETVAL, 0) failed: Invalid argument
FATAL:  semop(id=2621476) failed: Invalid argument
FATAL:  semop(id=2621476) failed: Invalid argument
FATAL:  semctl(2293786, 1, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(2621476, 10, SETVAL, 0) failed: Invalid argument
WARNING:  could not remove shared memory segment "/PostgreSQL.1621779631": No such file or directory
LOG:  database system is shut down
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit

The second time, I got this:

FATAL:  no free slots in PMChildFlags array
FATAL:  semctl(3604480, 14, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(3604480, 13, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(3604480, 12, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(3604480, 11, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(3604480, 10, SETVAL, 0) failed: Invalid argument
LOG:  database system is shut down
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit

followed hundreds of lines of terminating connection later by this:

FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to administrator command
STATEMENT:  SELECT column1 FROM table1 WHERE (column2='value1') AND (column3->>'key1'='value2') ORDER BY created ASC LIMIT 1
WARNING:  could not remove shared memory segment "/PostgreSQL.1032869518": No such file or directory
FATAL:  postmaster exited during a parallel transaction
LOG:  database system was interrupted; last known up at 2019-02-12 20:13:55 GMT
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  incomplete startup packet
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up

and hundreds of database start lines later by this:

FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  redo starts at E/D1504EA0
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  invalid record length at E/D15A21B0: wanted 24, got 0
LOG:  redo done at E/D15A2188
LOG:  last completed transaction was at log time 2019-02-12 20:15:58.259264+00
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up

Possibly of interest, the last completed transaction was 2 minutes after the last known time up.

I asked the application developers using the cluster if they had recently made any changes, and they said they had not. A Google search for the PMChildFlags error message found nothing recent. Since it seemed to be related to parallel queries, I looked up that behavior, and found a post that indicates that it'll use 3 workers if the table size is greater than 72 MB and configuration allows. We had max_parallel_workers_per_gather set at 10 for performance testing (the workload in no way needed that), so that was a possibility. I checked the sizes of tables in the cluster, and found that the table that the crashing query was accessing is 80 MB, and no other table in the cluster is that large.

My suspicion is that there's some sort of race condition that requires at least 3 parallel gather workers to hit, and most people aren't hitting it because they're not setting it that high, and there's something specific about my workload that triggers it. I think we started hitting it shortly after an autovacuum updated the query planner stats to reflect that the table was larger than 72 MB, which is why we didn't hit it at 72 MB.

Unfortunately, we didn't have more detailed timestamped logs at the time to confirm this theory, and it hasn't happened a third time. We've lowered max_parallel_workers_per_gather to 2, since this is a production system, but we may be able to rig up a reproducer with some guidance.

- Chris
On Wed, Feb 13, 2019 at 3:41 PM Chris Snook <csnook@cloudflare.com> wrote:
> For more context, I got these tightly packed around the first crash, with the first and last messages repeated
hundredsof times:
 
>
> FATAL:  sorry, too many clients already
> FATAL:  sorry, too many clients already
> FATAL:  sorry, too many clients already
> FATAL:  no free slots in PMChildFlags array
> WARNING:  could not remove shared memory segment "/PostgreSQL.1407760088": No such file or directory
> FATAL:  semop(id=2293786) failed: Invalid argument
> FATAL:  semop(id=2293786) failed: Invalid argument
> FATAL:  semctl(2064403, 7, SETVAL, 0) failed: Invalid argument
> FATAL:  semop(id=2621476) failed: Invalid argument
> FATAL:  semop(id=2621476) failed: Invalid argument
> FATAL:  semctl(2293786, 1, SETVAL, 0) failed: Invalid argument
> FATAL:  semctl(2621476, 10, SETVAL, 0) failed: Invalid argument
> WARNING:  could not remove shared memory segment "/PostgreSQL.1621779631": No such file or directory

Any chance you created a cronjob that runs as user "postgres" (or
whatever user the PostgreSQL cluster runs as), or logged in as that
user manually for some reason?  Systemd likes to blow away global IPC
resources associated with users when they log out.

https://www.postgresql.org/docs/11/kernel-resources.html#SYSTEMD-REMOVEIPC

-- 
Thomas Munro
http://www.enterprisedb.com


There was an idle psql session running in screen, invoked as sudo -u postgres psql. Salt is also routinely running a bunch of configuration assertion checks as the postgres user, but those are not login sessions either, and have been running sub-hourly for over a year without incident. Backups run from a replica, and this failure happened on the primary, and not proximal to a backup run. Because we're using stock Debian Stretch packages, that user is a system user (UID 110, GID 114), so that behavior wouldn't apply in this case.

If we can figure out how to reproduce it reliably outside of production, I'll turn all the logging options up to 11 so we can figure out if the shared memory error is immediately following the fatal error in the same process, or just a cleanup race as everything is shutting down. We haven't had a recurrence with max_parallel_workers_per_gather set to 2, but we also went for several hours after the two failures that were 63 minutes apart with it still set to 10, and it didn't reproduce in that time either, so that doesn't mean much.

- Chris

On Tue, Feb 12, 2019 at 9:55 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Feb 13, 2019 at 3:41 PM Chris Snook <csnook@cloudflare.com> wrote:
> For more context, I got these tightly packed around the first crash, with the first and last messages repeated hundreds of times:
>
> FATAL:  sorry, too many clients already
> FATAL:  sorry, too many clients already
> FATAL:  sorry, too many clients already
> FATAL:  no free slots in PMChildFlags array
> WARNING:  could not remove shared memory segment "/PostgreSQL.1407760088": No such file or directory
> FATAL:  semop(id=2293786) failed: Invalid argument
> FATAL:  semop(id=2293786) failed: Invalid argument
> FATAL:  semctl(2064403, 7, SETVAL, 0) failed: Invalid argument
> FATAL:  semop(id=2621476) failed: Invalid argument
> FATAL:  semop(id=2621476) failed: Invalid argument
> FATAL:  semctl(2293786, 1, SETVAL, 0) failed: Invalid argument
> FATAL:  semctl(2621476, 10, SETVAL, 0) failed: Invalid argument
> WARNING:  could not remove shared memory segment "/PostgreSQL.1621779631": No such file or directory

Any chance you created a cronjob that runs as user "postgres" (or
whatever user the PostgreSQL cluster runs as), or logged in as that
user manually for some reason?  Systemd likes to blow away global IPC
resources associated with users when they log out.

https://www.postgresql.org/docs/11/kernel-resources.html#SYSTEMD-REMOVEIPC

--
Thomas Munro
http://www.enterprisedb.com