Thread: Assertion failure on PG15 with modified test_shm_mq test

Assertion failure on PG15 with modified test_shm_mq test

From
Pavan Deolasee
Date:
Hello,

I've a slightly modified version of test_shm_mq, that I changed to include a shared fileset.  The motivation to do that came because I hit an assertion failure with PG15 while doing some development work on BDR and I suspected it to be a PG15 bug.

The stack trace looks as below:

(lldb) bt
* thread #1
  * frame #0: 0x00007ff8187b100e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007ff8187e71ff libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x00007ff818732d24 libsystem_c.dylib`abort + 123
    frame #3: 0x000000010fce1bab postgres`ExceptionalCondition(conditionName="pgstat_is_initialized && !pgstat_is_shutdown", errorType="FailedAssertion", fileName="pgstat.c", lineNumber=1227) at assert.c:69:2
    frame #4: 0x000000010fb06412 postgres`pgstat_assert_is_up at pgstat.c:1227:2
    frame #5: 0x000000010fb0d2c7 postgres`pgstat_get_entry_ref(kind=PGSTAT_KIND_DATABASE, dboid=0, objoid=0, create=true, created_entry=0x0000000000000000) at pgstat_shmem.c:406:2
    frame #6: 0x000000010fb07579 postgres`pgstat_prep_pending_entry(kind=PGSTAT_KIND_DATABASE, dboid=0, objoid=0, created_entry=0x0000000000000000) at pgstat.c:1068:14
    frame #7: 0x000000010fb09cce postgres`pgstat_prep_database_pending(dboid=0) at pgstat_database.c:327:14
    frame #8: 0x000000010fb09dff postgres`pgstat_report_tempfile(filesize=0) at pgstat_database.c:179:10
    frame #9: 0x000000010fa8dbe9 postgres`ReportTemporaryFileUsage(path="base/pgsql_tmp/pgsql_tmp17312.0.fileset/test_mq_sharefile.0", size=0) at fd.c:1521:2
    frame #10: 0x000000010fa8db3c postgres`PathNameDeleteTemporaryFile(path="base/pgsql_tmp/pgsql_tmp17312.0.fileset/test_mq_sharefile.0", error_on_failure=false) at fd.c:1945:3
    frame #11: 0x000000010fa8d3a8 postgres`unlink_if_exists_fname(fname="base/pgsql_tmp/pgsql_tmp17312.0.fileset/test_mq_sharefile.0", isdir=false, elevel=15) at fd.c:3674:3
    frame #12: 0x000000010fa8d270 postgres`walkdir(path="base/pgsql_tmp/pgsql_tmp17312.0.fileset", action=(postgres`unlink_if_exists_fname at fd.c:3663), process_symlinks=false, elevel=15) at fd.c:3573:5
    frame #13: 0x000000010fa8d0e2 postgres`PathNameDeleteTemporaryDir(dirname="base/pgsql_tmp/pgsql_tmp17312.0.fileset") at fd.c:1689:2
    frame #14: 0x000000010fa91ac1 postgres`FileSetDeleteAll(fileset=0x0000000119240870) at fileset.c:165:3
    frame #15: 0x000000010fa92b08 postgres`SharedFileSetOnDetach(segment=0x00007f93ff00a7c0, datum=4716759152) at sharedfileset.c:119:3
    frame #16: 0x000000010fa96b00 postgres`dsm_detach(seg=0x00007f93ff00a7c0) at dsm.c:801:3
    frame #17: 0x000000010fa96f51 postgres`dsm_backend_shutdown at dsm.c:738:3
    frame #18: 0x000000010fa99402 postgres`shmem_exit(code=1) at ipc.c:259:2
    frame #19: 0x000000010fa99227 postgres`proc_exit_prepare(code=1) at ipc.c:194:2
    frame #20: 0x000000010fa99133 postgres`proc_exit(code=1) at ipc.c:107:2
    frame #21: 0x000000010fce318c postgres`errfinish(filename="postgres.c", lineno=3204, funcname="ProcessInterrupts") at elog.c:661:3
    frame #22: 0x000000010fad7c51 postgres`ProcessInterrupts at postgres.c:3201:4
    frame #23: 0x000000011924d85b test_shm_mq.so`test_shm_mq_main(main_arg=1155036180) at worker.c:159:2
    frame #24: 0x000000010f9da11e postgres`StartBackgroundWorker at bgworker.c:858:2
    frame #25: 0x000000010f9e80b4 postgres`do_start_bgworker(rw=0x00007f93ef904080) at postmaster.c:5823:4
    frame #26: 0x000000010f9e2524 postgres`maybe_start_bgworkers at postmaster.c:6047:9
    frame #27: 0x000000010f9e0e63 postgres`sigusr1_handler(postgres_signal_arg=30) at postmaster.c:5204:3
    frame #28: 0x00007ff8187fcdfd libsystem_platform.dylib`_sigtramp + 29
    frame #29: 0x00007ff8187b2d5b libsystem_kernel.dylib`__select + 11
    frame #30: 0x000000010f9e268d postgres`ServerLoop at postmaster.c:1770:13
    frame #31: 0x000000010f9e0157 postgres`PostmasterMain(argc=8, argv=0x0000600002f30190) at postmaster.c:1478:11
    frame #32: 0x000000010f8bc930 postgres`main(argc=8, argv=0x0000600002f30190) at main.c:202:3
    frame #33: 0x000000011f7d651e dyld`start + 462

I notice that pgstat_shutdown_hook() is registered as a before-shmem-exit callback. The callback is responsible for detaching from the pgstat shared memory segment. But looks like other parts of the system still expect it to be available during later stages of proc exit.

It's not clear to me if pgstat shutdown should happen later or code that gets executed later in the cycle should not try to use pgstat. It's also entirely possible that my usage of SharedFileSet is completely wrong. If that's the case, please let me know the mistake in the usage.

Patch modifying the test case is attached. In order to reproduce the problem quickly,  I added a CHECK_FOR_INTERRUPTS() in the test, but I don't see why that would be a bad coding practice.

Thanks,
Pavan



Attachment

Re: Assertion failure on PG15 with modified test_shm_mq test

From
Michael Paquier
Date:
On Wed, Aug 17, 2022 at 11:15:28AM +0530, Pavan Deolasee wrote:
> I notice that pgstat_shutdown_hook() is registered as a before-shmem-exit
> callback. The callback is responsible for detaching from the pgstat shared
> memory segment. But looks like other parts of the system still expect it to
> be available during later stages of proc exit.
>
> It's not clear to me if pgstat shutdown should happen later or code that
> gets executed later in the cycle should not try to use pgstat. It's also
> entirely possible that my usage of SharedFileSet is completely wrong. If
> that's the case, please let me know the mistake in the usage.

That's visibly an issue with shared memory and the stats.  I have
added an open item.  Andres?
--
Michael

Attachment

Re: Assertion failure on PG15 with modified test_shm_mq test

From
Andres Freund
Date:
Hi,

On 2022-08-17 11:15:28 +0530, Pavan Deolasee wrote:
> I've a slightly modified version of test_shm_mq, that I changed to include
> a shared fileset.  The motivation to do that came because I hit an
> assertion failure with PG15 while doing some development work on BDR and I
> suspected it to be a PG15 bug.

> I notice that pgstat_shutdown_hook() is registered as a before-shmem-exit
> callback. The callback is responsible for detaching from the pgstat shared
> memory segment. But looks like other parts of the system still expect it to
> be available during later stages of proc exit.

> It's not clear to me if pgstat shutdown should happen later or code that
> gets executed later in the cycle should not try to use pgstat. It's also
> entirely possible that my usage of SharedFileSet is completely wrong. If
> that's the case, please let me know the mistake in the usage.

I don't think we have the infrastructure for a nice solution to this at the
moment - we need a fairly large overhaul of process initialization / shutdown
to handle these interdependencies nicely.

We can't move pgstat shutdown into on_dsm callback because that's too late to
allocate *new* dsm segments, which we might need to do while flushing
out pending stats.

See https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fa91d4c91f28f4819dc54f93adbd413a685e366a
for a way to avoid the problem.

Greetings,

Andres Freund



Re: Assertion failure on PG15 with modified test_shm_mq test

From
Andres Freund
Date:
Hi,

On 2022-08-17 15:02:28 +0900, Michael Paquier wrote:
> On Wed, Aug 17, 2022 at 11:15:28AM +0530, Pavan Deolasee wrote:
> > I notice that pgstat_shutdown_hook() is registered as a before-shmem-exit
> > callback. The callback is responsible for detaching from the pgstat shared
> > memory segment. But looks like other parts of the system still expect it to
> > be available during later stages of proc exit.
> > 
> > It's not clear to me if pgstat shutdown should happen later or code that
> > gets executed later in the cycle should not try to use pgstat. It's also
> > entirely possible that my usage of SharedFileSet is completely wrong. If
> > that's the case, please let me know the mistake in the usage.
> 
> That's visibly an issue with shared memory and the stats.  I have
> added an open item.  Andres?

I don't think there's anything reasonably done about this for 15, as explained
upthread. We need a big redesign of the shutdown sequence at some point, but
...

Greetings,

Andres Freund



Re: Assertion failure on PG15 with modified test_shm_mq test

From
Pavan Deolasee
Date:
Hi,

On Thu, Aug 18, 2022 at 5:38 AM Andres Freund <andres@anarazel.de> wrote:
I don't think we have the infrastructure for a nice solution to this at the
moment - we need a fairly large overhaul of process initialization / shutdown
to handle these interdependencies nicely.


Ok, understood.
 
We can't move pgstat shutdown into on_dsm callback because that's too late to
allocate *new* dsm segments, which we might need to do while flushing
out pending stats.

See https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fa91d4c91f28f4819dc54f93adbd413a685e366a
for a way to avoid the problem.


Thanks for the hint. I will try that approach. I wonder though if there is something more we can do. For example, would it make sense to throw a WARNING and avoid segfault if pgstat machinery is already shutdown? Just worried if the code can be reached from multiple paths and testing all of those would be difficult for extension developers, especially given this may happen in error recovery path.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb..com

Re: Assertion failure on PG15 with modified test_shm_mq test

From
Kyotaro Horiguchi
Date:
At Thu, 18 Aug 2022 16:58:24 +0530, Pavan Deolasee <pavan.deolasee@gmail.com> wrote in 
> Hi,
> 
> On Thu, Aug 18, 2022 at 5:38 AM Andres Freund <andres@anarazel.de> wrote:
> 
> > We can't move pgstat shutdown into on_dsm callback because that's too late
> > to
> > allocate *new* dsm segments, which we might need to do while flushing
> > out pending stats.
> >
> > See
> > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fa91d4c91f28f4819dc54f93adbd413a685e366a
> > for a way to avoid the problem.
> >
> >
> Thanks for the hint. I will try that approach. I wonder though if there is
> something more we can do. For example, would it make sense to throw a
> WARNING and avoid segfault if pgstat machinery is already shutdown? Just
> worried if the code can be reached from multiple paths and testing all of
> those would be difficult for extension developers, especially given this
> may happen in error recovery path.

I'm not sure how extensions can face this problem, but..

pgstat is designed not to lose reported numbers. The assertion is
manifets that intention.  It is not enabled on non-assertion builds
and pgstat enters undefined state then maybe crash after the assertion
point.  On the other hand I don't think we want to perform the same
check for the all places the assertion exists on non-assertion builds.

We cannot simply replace the assertion with ereport().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center