Home > mailing lists

Thread: Postmaster crashed during start

Postmaster crashed during start

From

Srinath Reddy

Date:

26 February, 06:22:13

Hi,
when we kill postmaster using kill -9 and start immediately it crashes with

FATAL: pre-existing shared memory block (key 2495405, ID 360501) is still in use

HINT: Terminate any old server processes associated with data directory

We can reproduce this

kill -9 $(head -n 1 $PGDATA/postmaster.pid) & ./pg_ctl -D $PGDATA -l $PGDATA/logfile start

Reason of crash:
when we kill postmaster with -9 signal the clean up does not happen where the shared memory segment won't be detached but kernel will does this when a process dies means the process which attached to segment will be detached so shm_nattch will be 0 but in case before kernel comes up to detach the process if we try to start postmaster again, during creation postmaster.pid using CreateDataDirLockFile() postmaster checks for whether previous shmem segment is still in use ,for this we are depending on shmStat.shm_nattch == 0 ? SHMSTATE_UNATTACHED : SHMSTATE_ATTACHED; as if kernel didn't come up so shm_attach is still 1 so the new postmaster will think the shmem segment is in use and crashes.

should we even consider this as a bug or we should leave it as it depends of how busy the kernel is and it didn't got time to do the clean up of the dead postmaster process so didn't detached and decrement the shmem_nattach.

thoughts?

Thanks and Regards

Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com

Re: Postmaster crashed during start

From

Tom Lane

Date:

26 February, 06:53:01

Srinath Reddy <srinath2133@gmail.com> writes:
> when we kill postmaster using kill -9 and start immediately it crashes with
>> FATAL:  pre-existing shared memory block (key 2495405, ID 360501) is still
>> in use

"Doctor, it hurts when I do this!"

"So don't do that!"

This is not a supported way of shutting down the postmaster, and it
never will be.  Use SIGINT, or SIGQUIT if you are in a desperate
hurry and are willing to have the next startup take longer.

I think the specific reason you are seeing this is that it takes
nonzero time for the postmaster's orphaned child processes to
notice that the postmaster is gone and terminate.  As long as
any of those children remain, the shared memory block will have
a nonzero reference count.  The new postmaster sees that and
refuses to start, for the very sound reason that it risks
data corruption if it brings up a new set of worker processes
while any of the old ones are still running.

            regards, tom lane

Re: Postmaster crashed during start

From

Srinath Reddy

Date:

26 February, 08:30:55

On Wed, Feb 26, 2025 at 9:50 AM Srinath Reddy <srinath2133@gmail.com> wrote:

On Wed, Feb 26, 2025 at 9:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Srinath Reddy <srinath2133@gmail.com> writes:
> when we kill postmaster using kill -9 and start immediately it crashes with
>> FATAL: pre-existing shared memory block (key 2495405, ID 360501) is still
>> in use

"Doctor, it hurts when I do this!"

"So don't do that!"

This is not a supported way of shutting down the postmaster, and it
never will be. Use SIGINT, or SIGQUIT if you are in a desperate
hurry and are willing to have the next startup take longer.
i was actually trying to recreate power outage scenario using node->kill9(),node->start() in a custom tap test,then i found this crash.

I think the specific reason you are seeing this is that it takes
nonzero time for the postmaster's orphaned child processes to
notice that the postmaster is gone and terminate. As long as
any of those children remain, the shared memory block will have
a nonzero reference count. The new postmaster sees that and
refuses to start, for the very sound reason that it risks
data corruption if it brings up a new set of worker processes
while any of the old ones are still running.

regards, tom lane

i am guessing you mean "reference count to shared memory block" means shmem_nattach right? i think this will be incremented by 1 when a process attached to the shmem segment using shmat() in postgres case its the postmaster who attaches during creation of shmem segment and detaches during postmaster's on_shmem_exit is called during if it exits properly or not dies suddenly (as the case with kill -9) ,during detaching only the shmem_nattach will be decremented by 1 ,AFAIK the child processes will get to use the shmem segment but never attaches or detaches so they are not effecting the shmem_nattach.so as the shmem_nattach is not 0 PGSharedMemoryAttach thinks the shmem state is still attached and in use.

Re: Postmaster crashed during start

From

Greg Sabino Mullane

Date:

26 February, 17:54:34

On Wed, Feb 26, 2025 at 12:31 AM Srinath Reddy <srinath2133@gmail.com> wrote:

i was actually trying to recreate power outage scenario using node->kill9(),node->start() in a custom tap test,then i found this crash.

LOL ,that's not a power outage test, that's a kill -9 postgres test. A true power outage would take care of any shared memory problems as well. Carefully clear the shared memory as part of the test (you can find the key in postmaster.pid), or do a proper test with something like:

echo b > /proc/sysrq-trigger

i am guessing you mean "reference count to shared memory block" means shmem_nattach right? i think this will be incremented by 1 when a process attached to the shmem segment using shmat() in postgres case its the postmaster who attaches during creation of shmem segment and detaches during postmaster's on_shmem_exit is called during if it exits properly or not dies suddenly (as the case with kill -9) ,during detaching only the shmem_nattach will be decremented by 1 ,AFAIK the child processes will get to use the shmem segment but never attaches or detaches so they are not effecting the shmem_nattach.so as the shmem_nattach is not 0 PGSharedMemoryAttach thinks the shmem state is still attached and in use.

You might be overthinking this. A server crash is much more likely than a random postgres crash. Test the former, by all means. The latter is expected to have some potential manual cleanup, for safety reasons as explained above.

Cheers,

Greg

Crunchy Data - https://www.crunchydata.com

Enterprise Postgres Software Products & Tech Support