Thread: Fix possible overflow of pg_stat DSA's refcnt

Fix possible overflow of pg_stat DSA's refcnt

From
Anthonin Bonnefoy
Date:
Hi,

During backend initialisation, pgStat DSA is attached using dsa_attach_in_place with a NULL segment. The NULL segment means that there's no callback to release the DSA when the process exits. pgstat_detach_shmem only calls dsa_detach which, as mentioned in the function's comment, doesn't include releasing and doesn't decrement the reference count of pgStat DSA.

Thus, every time a backend is created, pgStat DSA's refcnt is incremented but never decremented when the backend shutdown. It will eventually overflow and reach 0, triggering the "could not attach to dynamic shared area" error on all newly created backends. When this state is reached, the only way to recover is to restart the db to reset the counter.

The issue can be visible by calling dsa_dump in pgstat_detach_shmem and checking that refcnt's value is continuously increasing as new backends are created. It is also possible to reach the state where all connections are refused by editing the refcnt manually with lldb/gdb (The alternative, creating enough backends to reach 0 exists but can take some time). Setting it to -10 and then opening 10 connections will eventually generate the "could not attach" error.

This patch fixes this issue by releasing pgStat DSA with dsa_release_in_place during pgStat shutdown to correctly decrement the refcnt.

Regards,
Anthonin
Attachment

Re: Fix possible overflow of pg_stat DSA's refcnt

From
Michael Paquier
Date:
On Tue, Jun 25, 2024 at 05:01:55PM +0200, Anthonin Bonnefoy wrote:
> During backend initialisation, pgStat DSA is attached using
> dsa_attach_in_place with a NULL segment. The NULL segment means that
> there's no callback to release the DSA when the process exits.
> pgstat_detach_shmem only calls dsa_detach which, as mentioned in the
> function's comment, doesn't include releasing and doesn't decrement the
> reference count of pgStat DSA.
>
> Thus, every time a backend is created, pgStat DSA's refcnt is incremented
> but never decremented when the backend shutdown. It will eventually
> overflow and reach 0, triggering the "could not attach to dynamic shared
> area" error on all newly created backends. When this state is reached, the
> only way to recover is to restart the db to reset the counter.

Very good catch!  It looks like you have seen that in the field, then.
Sad face.

> This patch fixes this issue by releasing pgStat DSA with
> dsa_release_in_place during pgStat shutdown to correctly decrement the
> refcnt.

Sounds logic to me to do that in the pgstat shutdown callback, ordered
with the dsa_detach calls in a single location rather than registering
a different callback to do the same job.  Will fix and backpatch,
thanks for the report!
--
Michael

Attachment

Re: Fix possible overflow of pg_stat DSA's refcnt

From
Anthonin Bonnefoy
Date:
On Wed, Jun 26, 2024 at 7:40 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> Very good catch!  It looks like you have seen that in the field, then.
> Sad face.

Yeah, this happened last week on one of our replicas (version 15.5)
last week that had 134 days uptime. We are doing a lot of parallel
queries on this cluster so the combination of high uptime plus
parallel workers creation eventually triggered the issue.

> Will fix and backpatch, thanks for the report!

Thanks for handling this and for the quick answer!

Regards,
Anthonin



Re: Fix possible overflow of pg_stat DSA's refcnt

From
Michael Paquier
Date:
On Wed, Jun 26, 2024 at 08:48:06AM +0200, Anthonin Bonnefoy wrote:
> Yeah, this happened last week on one of our replicas (version 15.5)
> last week that had 134 days uptime. We are doing a lot of parallel
> queries on this cluster so the combination of high uptime plus
> parallel workers creation eventually triggered the issue.

It is not surprising that it would take this much amount of time
before detecting it.  I've applied the patch down to 15.  Thanks a lot
for the analysis and the patch!
--
Michael

Attachment