On Wed, Jul 23, 2025 at 7:09 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Jul 21, 2025 at 12:51:14PM +0530, Rahila Syed wrote:
> > This appears to be a valid issue where the Autovacuum worker fails while
> > already holding an
> > LWLock on one of the pgStatLocal.shared_hash partitions. As a result, when
> > we attempt to
> > access this table again during proc_exit cleanup in dshash_find, the assert
> > is triggered. I haven’t
> > yet checked exactly where the lock is acquired within the Autovacuum
> > worker, but as Dilip mentioned,
> > reviewing where the error occurs in the Autovacuum worker would be helpful.
>
> Per dsm_attach@dsm.c, about the original FATAL message "can't attach
> the same segment more than once" that triggers the assertion
> afterwards:
> * If you're hitting this error, you probably want to attempt to find an
> * existing mapping via dsm_find_mapping() before calling dsm_attach() to
> * create a new one.
>
> One thing that we could do is to upgrade this FATAL to a PANIC, to get
> an idea of the stack where the original problem happens.
>
> The stack is referencing a backend-level stats getting dropped by an
> autovacuum worker as a result of pgstat_drop_entry() done in
> pgstat_shutdown_hook(), so it looks like we are reaching a new error
> state in v18 that could not happen before within the DSM, as an after
> effect of the FATAL causing the autovacuum worker to stop. Never seen
> this one. We're already doing stats reports in the
> pgstat_report_stat() call with manipulations of the pgstats
> dshash while shutting down.
>
> objid at 5015 means that the procnum is set as such. How many
> max_connections do you have? It seems like a high number points to a
> better reproducibility.
>
> Robins, is that your host with gcc experimental? Could it be possible
> to re-run the test with a patched build with the FATAL upgraded to
> PANIC and see what happens?
Yeah that would be a good idea.
I was looking into the vacuum code to see when we acquire this lock
and what's the possibilities of throwing error without release the
lock, IIUC we acquire this lock while updating the vacuum stats using
pgstat_report_vacuum(), under that we get a
pgstat_get_entry_ref_locked() and if we do not find a cached entry we
acquire the partition lock of the hash for a short while and release
it after increasing the entries refcount. So in this particular path
I don't see the possibility of throwing an error while holding the
lock.
--
Regards,
Dilip Kumar
Google