Thread: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and swithing cache->snapshot
BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and swithing cache->snapshot
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17804 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 15.2 Operating system: Ubuntu 22.04 Description: The following transaction: BEGIN; SET LOCAL stats_fetch_consistency = cache; SELECT * FROM pg_stat_database; SET LOCAL stats_fetch_consistency = snapshot; SELECT pg_stat_get_function_calls(0); raises an assertion failure with the following stack: Core was generated by `postgres: law regression [local] SELECT '. Program terminated with signal SIGABRT, Aborted. warning: Section `.reg-xstate/1355782' in core file too small. #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140663482824512) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140663482824512) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140663482824512) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140663482824512, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007feec4a1a476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007feec4a007f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x0000555cf0603315 in ExceptionalCondition ( conditionName=0x555cf07ecf38 "pgStatLocal.snapshot.stats->members == 0", errorType=0x555cf07ec974 "FailedAssertion", fileName=0x555cf07ecc8f "pgstat.c", lineNumber=948) at assert.c:69 #6 0x0000555cf045e815 in pgstat_build_snapshot () at pgstat.c:948 #7 0x0000555cf045e39b in pgstat_fetch_entry (kind=PGSTAT_KIND_FUNCTION, dboid=16384, objoid=0) at pgstat.c:798 #8 0x0000555cf0461fae in pgstat_fetch_stat_funcentry (func_id=0) at pgstat_function.c:242 #9 0x0000555cf0544c66 in pg_stat_get_function_calls (fcinfo=0x555cf164cb68) at pgstatfuncs.c:384 #10 0x0000555cf01b7396 in ExecInterpExpr (state=0x555cf164ca80, econtext=0x555cf164c780, isnull=0x7ffe50dc040f) at execExprInterp.c:751 #11 0x0000555cf01b964e in ExecInterpExprStillValid (state=0x555cf164ca80, econtext=0x555cf164c780, isNull=0x7ffe50dc040f) at execExprInterp.c:1826 #12 0x0000555cf0217315 in ExecEvalExprSwitchContext (state=0x555cf164ca80, econtext=0x555cf164c780, isNull=0x7ffe50dc040f) at ../../../src/include/executor/executor.h:341 #13 0x0000555cf021738d in ExecProject (projInfo=0x555cf164ca78) at ../../../src/include/executor/executor.h:375 #14 0x0000555cf02175d7 in ExecResult (pstate=0x555cf164c668) at nodeResult.c:136 #15 0x0000555cf01d022d in ExecProcNodeFirst (node=0x555cf164c668) at execProcnode.c:464 #16 0x0000555cf01c34ee in ExecProcNode (node=0x555cf164c668) at ../../../src/include/executor/executor.h:259 #17 0x0000555cf01c63f5 in ExecutePlan (estate=0x555cf164c430, planstate=0x555cf164c668, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x555cf16519a0, execute_once=true) at execMain.c:1636 #18 0x0000555cf01c3c25 in standard_ExecutorRun (queryDesc=0x555cf165bba0, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:363 #19 0x0000555cf01c3a10 in ExecutorRun (queryDesc=0x555cf165bba0, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:307 #20 0x0000555cf043cae6 in PortalRunSelect (portal=0x555cf15eeb90, forward=true, count=0, dest=0x555cf16519a0) at pquery.c:924 #21 0x0000555cf043c704 in PortalRun (portal=0x555cf15eeb90, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x555cf16519a0, altdest=0x555cf16519a0, qc=0x7ffe50dc0810) at pquery.c:768 #22 0x0000555cf0435590 in exec_simple_query ( query_string=0x555cf1582570 "SELECT pg_stat_get_function_calls(0);") at postgres.c:1250 #23 0x0000555cf043a4af in PostgresMain (dbname=0x555cf15aea18 "regression", username=0x555cf15ae9f8 "law") at postgres.c:4593 #24 0x0000555cf035fd2b in BackendRun (port=0x555cf15a6130) at postmaster.c:4511 #25 0x0000555cf035f5b2 in BackendStartup (port=0x555cf15a6130) at postmaster.c:4239 #26 0x0000555cf035b53f in ServerLoop () at postmaster.c:1806 #27 0x0000555cf035ac9c in PostmasterMain (argc=3, argv=0x555cf157da30) at postmaster.c:1478 #28 0x0000555cf024ebdd in main (argc=3, argv=0x555cf157da30) at main.c:202 It appeared with the stats_fetch_consistency introduction (5891c7a8e). Looks like pgstat_fetch_entry() can fill up the snapshot.stats structure and it's contents preserved when switching the consistency mode, what pgstat_build_snapshot() doesn't expect.
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and swithing cache->snapshot
From
Kyotaro Horiguchi
Date:
At Wed, 22 Feb 2023 06:00:01 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > The following transaction: > BEGIN; > SET LOCAL stats_fetch_consistency = cache; > SELECT * FROM pg_stat_database; > SET LOCAL stats_fetch_consistency = snapshot; > SELECT pg_stat_get_function_calls(0); > > raises an assertion failure with the following stack: > Core was generated by `postgres: law regression [local] SELECT > '. > Program terminated with signal SIGABRT, Aborted. Thanks for the report! We didn't consider the case that variable is changed in-transaction. Perhaps we should reject changes to fetch consistency once we the snapshot hash is created. I can think of two ways to do this. One approach is to perform that check at GUC-level. However, this would result in an ERROR meassage that isn't very informative. Addition to that, it seems like aborting the transaction would be too severe a punishment for this operation. ERROR: invalid value for parameter "stats_fetch_consistency": "cache" DETAIL: pgstat_fetch_consistency cannot be changed in this context. ERROR: current transaction is aborted, commands ignored until end of transactio The attached is an alternative solution. If pgstat_fetch_consistency is different from the existing snapshot, pgstat_fetch_entry() will issue a warning then continue with the old snapshot mode instead. To make this work, I modified pgstat_prep_snapshot to determine the actual consistency mode to use, and then the subsequent code uses the returned mode. WARNING: ignored in-transaction change of statistics consistency The patch allows the transitions only none->cahce and none->snapshot. To dislable all types of transitions, it seems that we need either an additional boolean in the struct PgStat_LocalState or an additional enumerator in the enyum PgStat_FetchConsistency to keep track of whether pgstat_prep_snapshot() is called during the current transaction. What are your thoughts on this? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Alexander Lakhin
Date:
Hello Kyotaro-san, 24.02.2023 10:08, Kyotaro Horiguchi wrote: > Thanks for the report! We didn't consider the case that variable is > changed in-transaction. > > > What are your thoughts on this? What if we implement assign_pgstat_fetch_consistency() and execute pgstat_clear_snapshot() in it when the mode switch happens (to avoid possible inconsistency between the GUC and the current pgstat state)? If snapshots in different modes contain different data sets, will such behaviour be unexpected for a user? For example, if a user performs: SET LOCAL stats_fetch_consistency = snapshot; {some-actions-1} SET LOCAL stats_fetch_consistency = none; {some-actions-2} SET LOCAL stats_fetch_consistency = snapshot; {some-actions-3} Should she expect that some-actions-1, some-actions-2 will be reflected in the snapshot at the end? Best regards, Alexander
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Alexander Lakhin
Date:
Hello, 24.02.2023 19:00, Alexander Lakhin wrote: > Hello Kyotaro-san, > 24.02.2023 10:08, Kyotaro Horiguchi wrote: >> Thanks for the report! We didn't consider the case that variable is >> changed in-transaction. >> >> >> What are your thoughts on this? > What if we implement assign_pgstat_fetch_consistency() and execute > pgstat_clear_snapshot() in it when the mode switch happens (to avoid > possible inconsistency between the GUC and the current pgstat state)? Please look at the attached patch, which implements that idea and also includes a simple test. The behavior change doesn't affect any of existing tests and seems natural to me, so I haven't added a note about it in the docs. Best regards, Alexander
Attachment
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Kyotaro Horiguchi
Date:
At Thu, 27 Apr 2023 13:00:00 +0300, Alexander Lakhin <exclusion@gmail.com> wrote in > Hello, > > 24.02.2023 19:00, Alexander Lakhin wrote: > > Hello Kyotaro-san, > > 24.02.2023 10:08, Kyotaro Horiguchi wrote: > >> Thanks for the report! We didn't consider the case that variable is > >> changed in-transaction. > >> > >> > >> What are your thoughts on this? > > What if we implement assign_pgstat_fetch_consistency() and execute > > pgstat_clear_snapshot() in it when the mode switch happens (to avoid > > possible inconsistency between the GUC and the current pgstat state)? Thank you for the patch. (Sorry, I forgot about this..) > Please look at the attached patch, which implements that idea and also > includes a simple test. The behavior change doesn't affect any of > existing > tests and seems natural to me, so I haven't added a note about it in > the docs. I tried to keep existing snapshots as long as possible, but considering how rarely users change the variable and the lack of necessity to to preserve the snapshots, it's simply easier to clear them with each variable change. Nonetheless, I'm a bit concerned about making a direct call to pgstat_clear_snapshot() from the assign callback, it might be fine for now, but I worry that it could an issue later on. So, how about just settin a trigger that causes a snapshot clearing prior to the next use, like the attached? As for the test, I can't come up with a better one, but I think the comment should explain its intention more clearly. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Alexander Lakhin
Date:
Hello Kyotaro-san, 28.04.2023 04:41, Kyotaro Horiguchi wrote: > Nonetheless, I'm a bit concerned about making a direct call to > pgstat_clear_snapshot() from the assign callback, it might be fine for > now, but I worry that it could an issue later on. > > So, how about just settin a trigger that causes a snapshot clearing > prior to the next use, like the attached? I'm agree with the change — it's still rather simple and fixes the issue. Maybe you would also find appropriate to update the comment for pgstat_get_stat_snapshot_timestamp(). Perhaps add something like?: Note that the snapshot may be cleared here, if requested. > As for the test, I can't come up with a better one, but I think the > comment should explain its intention more clearly. I'd hesitated to mention a crash in the comment because I've seen assertion failures only (and no crash with a non-assert build), so I thought that the test should illustrate the new behavior in the first place. But I'm not opposed to your comment change, maybe it's more prominent indeed. Thank you for the fix! Best regards, Alexander
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Michael Paquier
Date:
On Fri, Apr 28, 2023 at 07:00:01AM +0300, Alexander Lakhin wrote: > 28.04.2023 04:41, Kyotaro Horiguchi wrote: >> Nonetheless, I'm a bit concerned about making a direct call to >> pgstat_clear_snapshot() from the assign callback, it might be fine for >> now, but I worry that it could an issue later on. >> >> So, how about just settin a trigger that causes a snapshot clearing >> prior to the next use, like the attached? > > I'm agree with the change — it's still rather simple and fixes the issue. > Maybe you would also find appropriate to update the comment for > pgstat_get_stat_snapshot_timestamp(). > Perhaps add something like?: > Note that the snapshot may be cleared here, if requested. FWIW, I was looking at this patch and this proposal of relying on a static flag to conditionally clear a snapshot looks rather brittle by design to me because this relies on quite a few assumptions that the snapshot will always be cleared when necessary, as proved by the two code paths of pgstat.c patched where each gain a check on clear_existing_snapshot. The AtEOXacts for pgstat when doing an abort/commit/prepare would guarantee that the flag is cleared, still that's not really exciting. What is the problem in forcing a snapshot cleanup each time stats_fetch_consistency is switched to a new value? Somebody using SET LOCAL on that means, at least it sounds to me as such, that they don't really care about the current snapshot contents so they'd better be wiped out. And the cleanup timing does not really seem to matter much. >> As for the test, I can't come up with a better one, but I think the >> comment should explain its intention more clearly. > > I'd hesitated to mention a crash in the comment because I've seen > assertion failures only (and no crash with a non-assert build), so I > thought that the test should illustrate the new behavior in the first place. > But I'm not opposed to your comment change, maybe it's more prominent indeed. If the issue is fixed, there would be no crash. -- Michael
Attachment
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Kyotaro Horiguchi
Date:
At Fri, 28 Apr 2023 13:37:23 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, Apr 28, 2023 at 07:00:01AM +0300, Alexander Lakhin wrote: > > 28.04.2023 04:41, Kyotaro Horiguchi wrote: > >> Nonetheless, I'm a bit concerned about making a direct call to > >> pgstat_clear_snapshot() from the assign callback, it might be fine for > >> now, but I worry that it could an issue later on. > >> > >> So, how about just settin a trigger that causes a snapshot clearing > >> prior to the next use, like the attached? > > > > I'm agree with the change ― it's still rather simple and fixes the issue. > > Maybe you would also find appropriate to update the comment for > > pgstat_get_stat_snapshot_timestamp(). > > Perhaps add something like?: > > Note that the snapshot may be cleared here, if requested. > > FWIW, I was looking at this patch and this proposal of relying on a > static flag to conditionally clear a snapshot looks rather brittle by > design to me because this relies on quite a few assumptions that the > snapshot will always be cleared when necessary, as proved by the two > code paths of pgstat.c patched where each gain a check on pgstat_get_stat_snapshot_timestmp() necessarily requires that. Just for seeming consistency. The only significant part is gstat_prep_snapshot, which is always called when the caller wants a snapshot. > clear_existing_snapshot. The AtEOXacts for pgstat when doing an > abort/commit/prepare would guarantee that the flag is cleared, still > that's not really exciting. What is the problem in forcing a snapshot > cleanup each time stats_fetch_consistency is switched to a new value? Just I wanted not to do that much in the guc callback including memory context operations. If it is compeltly safe, I don't object just clearing snapshots in the callback. > Somebody using SET LOCAL on that means, at least it sounds to me as > such, that they don't really care about the current snapshot contents > so they'd better be wiped out. And the cleanup timing does not really > seem to matter much. > > >> As for the test, I can't come up with a better one, but I think the > >> comment should explain its intention more clearly. > > > > I'd hesitated to mention a crash in the comment because I've seen > > assertion failures only (and no crash with a non-assert build), so I > > thought that the test should illustrate the new behavior in the first place. > > But I'm not opposed to your comment change, maybe it's more prominent indeed. > > If the issue is fixed, there would be no crash. If crash is not appropriate there, I am fine with "changing the variable clears snapshot". Regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Kyotaro Horiguchi
Date:
At Fri, 28 Apr 2023 15:04:04 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Fri, 28 Apr 2023 13:37:23 +0900, Michael Paquier <michael@paquier.xyz> wrote in > > FWIW, I was looking at this patch and this proposal of relying on a > > static flag to conditionally clear a snapshot looks rather brittle by > > design to me because this relies on quite a few assumptions that the > > snapshot will always be cleared when necessary, as proved by the two > > code paths of pgstat.c patched where each gain a check on > > pgstat_get_stat_snapshot_timestmp() necessarily requires that. Just Sorry, this is actually the opposite. pgstat_get_stat_snapshot_timestmp() doesn't necessarily requires that. Just > for seeming consistency. The only significant part is > gstat_prep_snapshot, which is always called when the caller wants a > snapshot. Regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Michael Paquier
Date:
On Fri, Apr 28, 2023 at 03:04:04PM +0900, Kyotaro Horiguchi wrote: > Just I wanted not to do that much in the guc callback including memory > context operations. If it is compeltly safe, I don't object just > clearing snapshots in the callback. I vaguely recalled some memory context deletions done in one of the assign callbacks, like the one for the plan resets, but it doesn't seem to be the case, looking closely.. Hm. -- Michael
Attachment
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Kyotaro Horiguchi
Date:
At Fri, 28 Apr 2023 15:48:16 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, Apr 28, 2023 at 03:04:04PM +0900, Kyotaro Horiguchi wrote: > > Just I wanted not to do that much in the guc callback including memory > > context operations. If it is compeltly safe, I don't object just > > clearing snapshots in the callback. > > I vaguely recalled some memory context deletions done in one of the > assign callbacks, like the one for the plan resets, but it doesn't > seem to be the case, looking closely.. Hm. For the the record, I'm not saying that it is dangerous to clear snapshots directly in the callback. In fact, as I mentioned earlier, I believe there is no issue with that. But, I believe it is simpler that the actual work is separate from the callback path since we don't need to worry about when the guc-callback will be called. Regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Michael Paquier
Date:
On Fri, Apr 28, 2023 at 04:07:52PM +0900, Kyotaro Horiguchi wrote: > For the the record, I'm not saying that it is dangerous to clear > snapshots directly in the callback. In fact, as I mentioned earlier, I > believe there is no issue with that. But, I believe it is simpler > that the actual work is separate from the callback path since we don't > need to worry about when the guc-callback will be called. pgstat_clear_backend_activity_snapshot() and pgstat_assert_is_up() are the two points of contention here. The former could be called in a non-backend context, which would be an incorrect use of this API. I am not completely sure that pgstat_assert_is_up() would be always correct, as well, though I have not been able to see a problem. Anyway, sorry for the late replay, it took me some time to study this subsystem and understand what's going on here (like why we have fixed snapshots but these don't list a timestamp, for example). At the end, I think that I'm OK with your suggestion to just force a snapshot cleanup each time the GUC is changed, mainly because that's simple to understand, and do the clear based on the timing where the next snapshot build would happen. I would be curious to hear arguments if there would be a need for a more complex mechanism, but that seems overkill to me as this is mainly here when switching out of the "snapshot" mode. A commit/abort would force a cleanup of the snapshot, clearing the flag, as well. I would add a small note in the docs about this behavior. Another thing is to add a few pg_stat_get_snapshot_timestamp() after building a snapshot for each mode tested. While on it, I have noticed that postgresql.conf.sample does not list the values available for stats_fetch_consistency. These had better be added to the sample file, no? At the end, I am finishing with the attached, without the sample file part. Thoughts? -- Michael
Attachment
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Michael Paquier
Date:
On Mon, May 01, 2023 at 02:18:00PM +0900, Michael Paquier wrote: > At the end, I am finishing with the attached, without the sample file > part. Thoughts? FYI, I am planning to get this stuff done next week after the next set of minor releases gets tagged (time was a bit scarce here this week to get that properly done). If there are no objections, of course. -- Michael
Attachment
Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot
From
Michael Paquier
Date:
On Sat, May 06, 2023 at 12:28:51PM +0900, Michael Paquier wrote: > FYI, I am planning to get this stuff done next week after the next set > of minor releases gets tagged (time was a bit scarce here this week to > get that properly done). If there are no objections, of course. Okay, so applied and backpatched down to 15 after a second review of the whole. -- Michael