Thread: elog.c query_id support vs shutdown

elog.c query_id support vs shutdown

From
Andres Freund
Date:
Hi,

In the course of [1] I again looked at all pgstat_* callsites and noticed the
new calls to pgstat_get_my_query_id() in elog.c.

As currently implemented those pgstat_get_my_query_id() calls are not
safe. It's fine during backend startup because MyBEEntry is not set, but
during shutdown that's not ok, because we never unset MyBEEntry.

andres@awork3:~/src/postgresql$ /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --single postgres -D
/srv/dev/pgdev-dev/-c 'log_line_prefix=%Q' -c log_min_messages=debug1
 
0NOTICE:  database system was shut down at 2021-08-07 16:42:26 PDT
0DEBUG:  checkpoint record is at 1/1C000508
0DEBUG:  redo record is at 1/1C000508; shutdown true
0DEBUG:  next transaction ID: 213067; next OID: 16434
0DEBUG:  next MultiXactId: 1; next MultiXactOffset: 0
0DEBUG:  oldest unfrozen transaction ID: 701, in database 1
0DEBUG:  oldest MultiXactId: 1, in database 1
0DEBUG:  commit timestamp Xid oldest/newest: 0/0
0DEBUG:  transaction ID wrap limit is 2147484348, limited by database with OID 1
0DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
0DEBUG:  starting up replication slots
0DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
0DEBUG:  MultiXact member stop limit is now 4294914944 based on MultiXact 1

PostgreSQL stand-alone backend 15devel
backend> 0NOTICE:  shutting down
0DEBUG:  performing replication slot checkpoint
Segmentation fault

#0  0x000055cb9b871632 in pgstat_get_my_query_id () at
/home/andres/src/postgresql/src/backend/utils/activity/backend_status.c:1042
#1  0x000055cb9ba00398 in log_line_prefix (buf=0x7ffc751092e0, edata=0x55cb9bd6a900 <errordata>)
    at /home/andres/src/postgresql/src/backend/utils/error/elog.c:2723
#2  0x000055cb9ba00ee6 in send_message_to_server_log (edata=0x55cb9bd6a900 <errordata>) at
/home/andres/src/postgresql/src/backend/utils/error/elog.c:3017
#3  0x000055cb9b9fe1b3 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1542
#4  0x000055cb9b9fb600 in errfinish (filename=0x55cb9bc1a0eb "miscinit.c", lineno=974, funcname=0x55cb9bc1aea0
<__func__.5>"UnlinkLockFiles")
 
    at /home/andres/src/postgresql/src/backend/utils/error/elog.c:597
#5  0x000055cb9ba0f2b4 in UnlinkLockFiles (status=0, arg=0) at
/home/andres/src/postgresql/src/backend/utils/init/miscinit.c:974
#6  0x000055cb9b8148da in proc_exit_prepare (code=0) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:209
#7  0x000055cb9b814792 in proc_exit (code=0) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:107
#8  0x000055cb9b851568 in PostgresMain (argc=9, argv=0x55cb9daddec0, dbname=0x55cb9dac4f70 "postgres",
username=0x55cb9dac3580"andres")
 
    at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4702
#9  0x000055cb9b672c5b in main (argc=9, argv=0x55cb9daddec0) at
/home/andres/src/postgresql/src/backend/main/main.c:195


I suspect that to make the elog.c usage safe, we'll have to clear MyBEEntry in
pgstat_beshutdown_hook().

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20210807210349.bby5ta2xrbnte6ht%40alap3.anarazel.de



Re: elog.c query_id support vs shutdown

From
Julien Rouhaud
Date:
On Sat, Aug 07, 2021 at 04:44:07PM -0700, Andres Freund wrote:
> 
> As currently implemented those pgstat_get_my_query_id() calls are not
> safe. It's fine during backend startup because MyBEEntry is not set, but
> during shutdown that's not ok, because we never unset MyBEEntry.
> 
> andres@awork3:~/src/postgresql$ /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --single postgres
-D/srv/dev/pgdev-dev/ -c 'log_line_prefix=%Q' -c log_min_messages=debug1
 
> [...]
> PostgreSQL stand-alone backend 15devel
> backend> 0NOTICE:  shutting down
> 0DEBUG:  performing replication slot checkpoint
> Segmentation fault

Ouch

> I suspect that to make the elog.c usage safe, we'll have to clear MyBEEntry in
> pgstat_beshutdown_hook().

I agree, and a quick test indeed fix your scenario.  It also seems like a good
thing to do overall.

I didn't find any other problematic corner cases, but I'm not that familiar with
pgstat, especially after the recent activity.



Re: elog.c query_id support vs shutdown

From
Julien Rouhaud
Date:
On Sun, Aug 08, 2021 at 01:46:39PM +0800, Julien Rouhaud wrote:
> On Sat, Aug 07, 2021 at 04:44:07PM -0700, Andres Freund wrote:
> > 
> > As currently implemented those pgstat_get_my_query_id() calls are not
> > safe. It's fine during backend startup because MyBEEntry is not set, but
> > during shutdown that's not ok, because we never unset MyBEEntry.
> > 
> > andres@awork3:~/src/postgresql$ /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --single postgres
-D/srv/dev/pgdev-dev/ -c 'log_line_prefix=%Q' -c log_min_messages=debug1
 
> > [...]
> > PostgreSQL stand-alone backend 15devel
> > backend> 0NOTICE:  shutting down
> > 0DEBUG:  performing replication slot checkpoint
> > Segmentation fault
> 
FTR I just added an open item for that.



Re: elog.c query_id support vs shutdown

From
Andres Freund
Date:
Hi,

On 2021-08-08 13:46:39 +0800, Julien Rouhaud wrote:
> On Sat, Aug 07, 2021 at 04:44:07PM -0700, Andres Freund wrote:
> > 
> > As currently implemented those pgstat_get_my_query_id() calls are not
> > safe. It's fine during backend startup because MyBEEntry is not set, but
> > during shutdown that's not ok, because we never unset MyBEEntry.
> > 
> > andres@awork3:~/src/postgresql$ /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --single postgres
-D/srv/dev/pgdev-dev/ -c 'log_line_prefix=%Q' -c log_min_messages=debug1
 
> > [...]
> > PostgreSQL stand-alone backend 15devel
> > backend> 0NOTICE:  shutting down
> > 0DEBUG:  performing replication slot checkpoint
> > Segmentation fault
> 
> Ouch
> 
> > I suspect that to make the elog.c usage safe, we'll have to clear MyBEEntry in
> > pgstat_beshutdown_hook().
> 
> I agree, and a quick test indeed fix your scenario.  It also seems like a good
> thing to do overall.

Yea, it does seem like a good thing. But we should do a search for the
problems it could cause...


> I didn't find any other problematic corner cases, but I'm not that familiar
> with pgstat, especially after the recent activity.

I don't think anything relevant to this issue has changed so far... And there
shouldn't be a meaningful amount of change to backend_status.c anyway - the
"what is currently happening" stuff that backend_status.c implements is mostly
independent from the "what has happened so far" that pgstats.c implements.

It probably would be a good idea to separate out the two namespaces more
clearly. Even with things like pgstat_report_activity() not being entirely
clear cut (because of the pgstat_count_conn_* calls) it still seems like it
would be an improvement. But I don't want to do that before the shared memory
stuff is in.

Greetings,

Andres Freund



Re: elog.c query_id support vs shutdown

From
Andres Freund
Date:
Hi,

On 2021-08-08 11:53:39 -0700, Andres Freund wrote:
> On 2021-08-08 13:46:39 +0800, Julien Rouhaud wrote:
> > > I suspect that to make the elog.c usage safe, we'll have to clear MyBEEntry in
> > > pgstat_beshutdown_hook().
> > 
> > I agree, and a quick test indeed fix your scenario.  It also seems like a good
> > thing to do overall.
> 
> Yea, it does seem like a good thing. But we should do a search for the
> problems it could cause...

Not a problem with unsetting MyBEEntry. But the search for problems made me
reread the following comment:

    /*
     * There's no need for a lock around pgstat_begin_read_activity /
     * pgstat_end_read_activity here as it's only called from
     * pg_stat_get_activity which is already protected, or from the same
     * backend which means that there won't be concurrent writes.
     */

I don't understand the pg_stat_get_activity() part of this comment?
pgstat_get_my_query_id() hardcodes MyBEEntry->st_query_id, so it can't be
useful to pg_stat_get_activity(), nor is it used there?

I assume it's just a remnant from an earlier iteration of the code?

Greetings,

Andres Freund



Re: elog.c query_id support vs shutdown

From
Julien Rouhaud
Date:
On Mon, Aug 9, 2021 at 3:06 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-08-08 11:53:39 -0700, Andres Freund wrote:
> > On 2021-08-08 13:46:39 +0800, Julien Rouhaud wrote:
> > > > I suspect that to make the elog.c usage safe, we'll have to clear MyBEEntry in
> > > > pgstat_beshutdown_hook().
> > >
> > > I agree, and a quick test indeed fix your scenario.  It also seems like a good
> > > thing to do overall.
> >
> > Yea, it does seem like a good thing. But we should do a search for the
> > problems it could cause...

Agreed, and I'm also looking into it.

> Not a problem with unsetting MyBEEntry. But the search for problems made me
> reread the following comment:
>
>         /*
>          * There's no need for a lock around pgstat_begin_read_activity /
>          * pgstat_end_read_activity here as it's only called from
>          * pg_stat_get_activity which is already protected, or from the same
>          * backend which means that there won't be concurrent writes.
>          */
>
> I don't understand the pg_stat_get_activity() part of this comment?
> pgstat_get_my_query_id() hardcodes MyBEEntry->st_query_id, so it can't be
> useful to pg_stat_get_activity(), nor is it used there?
>
> I assume it's just a remnant from an earlier iteration of the code?

Ah indeed!  This was quite a long thread so I didn't try to see when
that changed.  I also now realize that I made a typo in the patch
where I s/loop/look/ which was then changed to s/look/lock/.  The
comment should be something like:

/*
* There's no need for a loop around pgstat_begin_read_activity /
* pgstat_end_read_activity here as it's only called from the same backend
* which means that there won't be concurrent writes.
*/



Re: elog.c query_id support vs shutdown

From
Michael Paquier
Date:
On Mon, Aug 09, 2021 at 11:29:24AM +0800, Julien Rouhaud wrote:
> Ah indeed!  This was quite a long thread so I didn't try to see when
> that changed.  I also now realize that I made a typo in the patch
> where I s/loop/look/ which was then changed to s/look/lock/.  The
> comment should be something like:

The thread of this open item is now idle for 10 days and there does
not seem to be a lot of progress.  Bruce, this is assigned to you.
Are you planning to look at it?
--
Michael

Attachment

Re: elog.c query_id support vs shutdown

From
Julien Rouhaud
Date:
On Thu, Aug 19, 2021 at 3:05 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> The thread of this open item is now idle for 10 days and there does
> not seem to be a lot of progress.  Bruce, this is assigned to you.
> Are you planning to look at it?

I'm really sorry for the lack of answer on my side, I had too many
duties at work to thoroughly look at the implication of clearing
MyBEEntry in the pgstat_beshutdown_hook :(

With what I've seen so far I didn't find any evidence that it could
lead to any new problem (it should actually make things slightly
safer) and couldn't hit any issue with some testing.  Opinion from
someone more familiar with pgstats is of course welcome.

While reading the various code, I also found this incorrect comment in
backend_status.c:

/* exposed so that progress.c can access it */
PgBackendStatus *MyBEEntry = NULL;

progress.c was apparently renamed to backend_progress.c sometime
during e1025044 development (Split backend status and progress related
functionality out of pgstat.c.).

So I'm +1 for the fix originally suggested by Andres.



Re: elog.c query_id support vs shutdown

From
Michael Paquier
Date:
On Thu, Aug 19, 2021 at 03:29:40PM +0800, Julien Rouhaud wrote:
> On Thu, Aug 19, 2021 at 3:05 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > The thread of this open item is now idle for 10 days and there does
> > not seem to be a lot of progress.  Bruce, this is assigned to you.
> > Are you planning to look at it?
>
> I'm really sorry for the lack of answer on my side, I had too many
> duties at work to thoroughly look at the implication of clearing
> MyBEEntry in the pgstat_beshutdown_hook :(
>
> With what I've seen so far I didn't find any evidence that it could
> lead to any new problem (it should actually make things slightly
> safer) and couldn't hit any issue with some testing.  Opinion from
> someone more familiar with pgstats is of course welcome.

Okay.  Would you like to write a patch then?
--
Michael

Attachment

Re: elog.c query_id support vs shutdown

From
Andres Freund
Date:
On 2021-08-19 17:05:01 +0900, Michael Paquier wrote:
> Okay.  Would you like to write a patch then?

I'll push something in a bit...

- Andres



Re: elog.c query_id support vs shutdown

From
Andres Freund
Date:
On 2021-08-19 01:10:55 -0700, Andres Freund wrote:
> On 2021-08-19 17:05:01 +0900, Michael Paquier wrote:
> > Okay.  Would you like to write a patch then?
> 
> I'll push something in a bit...

Done.



Re: elog.c query_id support vs shutdown

From
Julien Rouhaud
Date:
On Thu, Aug 19, 2021 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2021-08-19 01:10:55 -0700, Andres Freund wrote:
> > On 2021-08-19 17:05:01 +0900, Michael Paquier wrote:
> > > Okay.  Would you like to write a patch then?
> >
> > I'll push something in a bit...
>
> Done.

Thanks!



Re: elog.c query_id support vs shutdown

From
Bruce Momjian
Date:
On Thu, Aug 19, 2021 at 05:20:52AM -0700, Andres Freund wrote:
> On 2021-08-19 01:10:55 -0700, Andres Freund wrote:
> > On 2021-08-19 17:05:01 +0900, Michael Paquier wrote:
> > > Okay.  Would you like to write a patch then?
> > 
> > I'll push something in a bit...
> 
> Done.

Thanks.  I was kind of wondering since there had not been a patch posted
yet.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: elog.c query_id support vs shutdown

From
Michael Paquier
Date:
On Thu, Aug 19, 2021 at 05:20:52AM -0700, Andres Freund wrote:
> Done.

Thanks.  There does not seem to be anything left here, so I have
marked this open item as fixed.
--
Michael

Attachment