`pg_ctl init` crashes when run concurrently; semget(2) suspected - Mailing list pgsql-hackers

From Gavin Panella
Subject `pg_ctl init` crashes when run concurrently; semget(2) suspected
Date
Msg-id CALL7chmzY3eXHA7zHnODUVGZLSvK3wYCSP0RmcDFHJY8f28Q3g@mail.gmail.com
Whole thread Raw
Responses Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
List pgsql-hackers
Hi,

Summary: semget(2) behaves differently on macOS and requires extra care.

I have many tests which spin up clusters using `pg_ctl init`, each in its own single-use temporary directory. Each test is run for every PostgreSQL installation found on the host machine. These tests are often run concurrently. Since adding PostgreSQL 17 to the mix, I've been getting sporadic failures on macOS:

  FATAL:  could not create semaphores: Invalid argument
  DETAIL:  Failed system call was semget(176163502, 20, 03600).
  child process exited with exit code 1

I think it's related to the increase of SEMAS_PER_SET in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of semget(2) on macOS.

I think the bug manifests because:
  • I create two clusters concurrently using `pg_ctl init`. One cluster is PostgreSQL 17; the other is PostgreSQL 16 or earlier.
  • Their data directories are separate but created close enough in time to have sequential inodes. This is relevant because the inode is used to seed the semaphore key.
  • Somehow (waves hands) semget(2) in PostgreSQL 17 is called with a key that points at a preexisting semaphore set. On Linux, due to the IPC_CREAT | IPC_EXCL flags, this returns <0 and sets errno to EEXIST. On macOS, it sets it instead to EINVAL, likely because the requested number of semaphores is greater than those in the existing set. This is in the InternalIpcSemaphoreCreate function, which then aborts the process.
The attached patch fixes the issue, I think, and has another description of this mechanism. On EINVAL it adds an additional call to semget(2) but for zero semaphores.

The patch is relative to master, but I developed it against REL_17_5; it should apply cleanly to both. I think it would be good to backport a fix to 17 too.

If anyone is feeling nerd-sniped, some better proof of the "waves hands" bit would be useful, because that was a working hypothesis that led to a working fix, and I have not yet had time to investigate further.

Please consider the patch for review. Thanks!

Gavin.
Attachment

pgsql-hackers by date:

Previous
From: Kirill Reshke
Date:
Subject: Re: Test instability when pg_dump orders by OID
Next
From: Tom Lane
Date:
Subject: Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected