Thread: Crash in new pgstats code
mylodon just showed a new-to-me failure mode [1]: Core was generated by `postgres: cascade: startup recovering 000000010000000000000002 '. Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49 49 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49 #1 0x00007f8b8db2d546 in __GI_abort () at abort.c:79 #2 0x000000000098a4dc in ExceptionalCondition (conditionName=<optimized out>, errorType=0x9e8061 "FailedAssertion", fileName=0xaf811f"/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/lib/dshash.c", lineNumber=lineNumber@entry=744)at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/error/assert.c:69 #3 0x00000000006dbe65 in dshash_delete_current (status=status@entry=0x7fffec732dc8) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/lib/dshash.c:744 #4 0x000000000085f911 in pgstat_free_entry (shent=0x7f8b8b0fc320, hstat=0x7fffec732dc8) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:741 #5 pgstat_drop_entry_internal (shent=0x7f8b8b0fc320, hstat=hstat@entry=0x7fffec732dc8) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:773 #6 0x000000000085fa2e in pgstat_drop_all_entries () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:887 #7 0x0000000000859301 in pgstat_reset_after_failure () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat.c:1631 #8 pgstat_discard_stats () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat.c:435 #9 0x0000000000555ae0 in StartupXLOG () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/access/transam/xlog.c:5127 #10 0x00000000007a8ece in StartupProcessMain () at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/startup.c:267 #11 0x000000000079f44e in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/auxprocess.c:141 #12 0x00000000007a5891 in StartChildProcess (type=type@entry=StartupProcess) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5417 #13 0x00000000007a3ea0 in PostmasterMain (argc=argc@entry=4, argv=<optimized out>, argv@entry=0x1d1cc90) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1457 #14 0x00000000006f1b91 in main (argc=4, argv=0x1d1cc90) at /mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:202 $1 = {si_signo = 6, si_errno = 0, si_code = -6, _sifields = {_pad = {675836, 1001, 0 <repeats 26 times>}, _kill = {si_pid= 675836, si_uid = 1001}, _timer = {si_tid = 675836, si_overrun = 1001, si_sigval = {sival_int = 0, sival_ptr = 0x0}},_rt = {si_pid = 675836, si_uid = 1001, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 675836,si_uid = 1001, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x3e9000a4ffc, _addr_lsb = 0, _addr_bnd= {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 4299262939132, si_fd = 0}}} Probably deserves investigation. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2022-04-15%2011%3A51%3A35
I wrote: > mylodon just showed a new-to-me failure mode [1]: Another occurrence here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-15%2022%3A42%3A07 I've added an open item. regards, tom lane
Hi On 2022-04-15 13:28:35 -0400, Tom Lane wrote: > mylodon just showed a new-to-me failure mode [1]: Thanks. Found the bug (pgstat_drop_all_entries() passed the wrong lock level), with the obvious fix. This failed to fail in other tests because they all end up resetting only when there's no stats. It's not too hard to write a test for that, which is how I reproduced the issue. I'm planning to make it a bit easier to test by verifying that 'E' in pgstat_read_statsfile() actually is just before EOF. That seems like a good check anyway. What confuses me so far is what already had generated stats before reaching pgstat_reset_after_failure() (so that the bug could even be hit in t/025_stuck_on_old_timeline.pl). Greetings, Andres Freund
Hi On 2022-04-16 12:13:09 -0700, Andres Freund wrote: > What confuses me so far is what already had generated stats before > reaching pgstat_reset_after_failure() (so that the bug could even be hit > in t/025_stuck_on_old_timeline.pl). I see part of a problem - in archiver stats. Even in 14 (and presumably before), we do work that can generate archiver stats (e.g. ReadCheckpointRecord()) before pgstat_reset_all(). It's not the end of the world, but doesn't seem great. But since archiver stats are fixed-numbered stats (and thus not in the hash table), they'd not trigger the backtrace we saw here. One thing that's interesting is that the failing tests have: 2022-04-15 12:07:48.828 UTC [675922][walreceiver][:0] FATAL: could not link file "pg_wal/xlogtemp.675922" to "pg_wal/00000002.history":File exists which I haven't seen locally. Looks like we have some race between startup process and walreceiver? That seems not great. I'm a bit confused that walreceiver and archiving are both active at the same time in the first place - that doesn't seem right as things are set up currently. Greetings, Andres Freund
Hi, On 2022-04-16 12:13:09 -0700, Andres Freund wrote: > On 2022-04-15 13:28:35 -0400, Tom Lane wrote: > > mylodon just showed a new-to-me failure mode [1]: > > Thanks. Found the bug (pgstat_drop_all_entries() passed the wrong lock > level), with the obvious fix. > > This failed to fail in other tests because they all end up resetting > only when there's no stats. It's not too hard to write a test for that, > which is how I reproduced the issue. > > I'm planning to make it a bit easier to test by verifying that 'E' in > pgstat_read_statsfile() actually is just before EOF. That seems like a > good check anyway. I've pushed that fix. > What confuses me so far is what already had generated stats before > reaching pgstat_reset_after_failure() (so that the bug could even be hit > in t/025_stuck_on_old_timeline.pl). But there's still things I don't understand about that aspect. Greetings, Andres Freund
On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote: > which I haven't seen locally. Looks like we have some race between > startup process and walreceiver? That seems not great. I'm a bit > confused that walreceiver and archiving are both active at the same time > in the first place - that doesn't seem right as things are set up > currently. Yeah, that should be exclusively one or the other, never both. WaitForWALToBecomeAvailable() would be a hot spot when it comes to decide when a WAL receiver should be spawned by the startup process. Except from the recent refactoring of xlog.c or the WAL prefetch work, there has not been many changes in this area lately. -- Michael
Attachment
On Mon, Apr 18, 2022 at 7:19 PM Michael Paquier <michael@paquier.xyz> wrote: > On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote: > > which I haven't seen locally. Looks like we have some race between > > startup process and walreceiver? That seems not great. I'm a bit > > confused that walreceiver and archiving are both active at the same time > > in the first place - that doesn't seem right as things are set up > > currently. > > Yeah, that should be exclusively one or the other, never both. > WaitForWALToBecomeAvailable() would be a hot spot when it comes to > decide when a WAL receiver should be spawned by the startup process. > Except from the recent refactoring of xlog.c or the WAL prefetch work, > there has not been many changes in this area lately. Hmm, well I'm not sure what is happening here and will try to dig tomorrow, but one observation from some log scraping is that kestrel logged similar output with "could not link file" several times before the main prefetching commit (5dc0418). I looked back 3 months on kestrel/HEAD and found these: commit | log ---------+------------------------------------------------------------------------------------------------------------------- 411b913 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-27%2010:57:20&stg=recovery-check 3d067c5 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-29%2017:52:32&stg=recovery-check cd7ea75 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-30%2015:25:03&stg=recovery-check 8e053dc | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-30%2020:27:44&stg=recovery-check 4e34747 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-04%2020:32:24&stg=recovery-check 01effb1 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-06%2007:32:40&stg=recovery-check fbfe691 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-07%2005:10:05&stg=recovery-check 5dc0418 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-07%2007:51:00&stg=recovery-check bd037dc | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-11%2022:00:58&stg=recovery-check a4b5754 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-12%2004:40:44&stg=recovery-check 7129a97 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-15%2022:42:07&stg=recovery-check 9f4f0a0 | https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-16%2020:05:34&stg=recovery-check
Hi, On 2022-04-18 22:45:07 +1200, Thomas Munro wrote: > On Mon, Apr 18, 2022 at 7:19 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote: > > > which I haven't seen locally. Looks like we have some race between > > > startup process and walreceiver? That seems not great. I'm a bit > > > confused that walreceiver and archiving are both active at the same time > > > in the first place - that doesn't seem right as things are set up > > > currently. > > > > Yeah, that should be exclusively one or the other, never both. > > WaitForWALToBecomeAvailable() would be a hot spot when it comes to > > decide when a WAL receiver should be spawned by the startup process. > > Except from the recent refactoring of xlog.c or the WAL prefetch work, > > there has not been many changes in this area lately. > > Hmm, well I'm not sure what is happening here and will try to dig > tomorrow, but one observation from some log scraping is that kestrel > logged similar output with "could not link file" several times before > the main prefetching commit (5dc0418). I looked back 3 months on > kestrel/HEAD and found these: Kestrel won't go that far back even - I set it up 23 days ago... I'm formally on vacation till Thursday, I'll try to look at earlier instances then. Unless it's already figured out :). I failed at reproducing it locally, despite a fair bit of effort. The BF really should break out individual tests into their own stage logs. The recovery-check stage is 13MB and 150k lines by now. Greetings, Andres Freund
On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote: > Kestrel won't go that far back even - I set it up 23 days ago... Here's a ~6 month old example from mylodon (I can't see much further back than that with HTTP requests... I guess BF records are purged?): https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mylodon&dt=2021-10-19%2022%3A57%3A54&stg=recovery-check
On Tue, Apr 19, 2022 at 08:31:05PM +1200, Thomas Munro wrote: > On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote: > > Kestrel won't go that far back even - I set it up 23 days ago... > > Here's a ~6 month old example from mylodon (I can't see much further > back than that with HTTP requests... I guess BF records are purged?): > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mylodon&dt=2021-10-19%2022%3A57%3A54&stg=recovery-check Do we have anything remaining on this thread in light of the upcoming beta1? One fix has been pushed upthread, but it does not seem we are completely in the clear either. -- Michael
Attachment
Hi, On 2022-05-11 15:46:13 +0900, Michael Paquier wrote: > On Tue, Apr 19, 2022 at 08:31:05PM +1200, Thomas Munro wrote: > > On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote: > > > Kestrel won't go that far back even - I set it up 23 days ago... > > > > Here's a ~6 month old example from mylodon (I can't see much further > > back than that with HTTP requests... I guess BF records are purged?): > > > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mylodon&dt=2021-10-19%2022%3A57%3A54&stg=recovery-check > > Do we have anything remaining on this thread in light of the upcoming > beta1? One fix has been pushed upthread, but it does not seem we are > completely in the clear either. I don't know what else there is to do, tbh. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-05-11 15:46:13 +0900, Michael Paquier wrote: >> Do we have anything remaining on this thread in light of the upcoming >> beta1? One fix has been pushed upthread, but it does not seem we are >> completely in the clear either. > I don't know what else there is to do, tbh. Well, it was mostly you expressing misgivings upthread ;-). But we have not seen any pgstat crashes lately, so I'm content to mark the open item as resolved. regards, tom lane
Hi, On 2022-05-12 12:12:59 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-05-11 15:46:13 +0900, Michael Paquier wrote: > >> Do we have anything remaining on this thread in light of the upcoming > >> beta1? One fix has been pushed upthread, but it does not seem we are > >> completely in the clear either. > > > I don't know what else there is to do, tbh. > > Well, it was mostly you expressing misgivings upthread ;-). Those mostly were about stuff in 14 as well... I guess it'd be good to figure out precisely how the problem was triggered, but without further information I don't quite see how to figure it out... > But we have not seen any pgstat crashes lately, so I'm content to mark the > open item as resolved. Cool. Greetings, Andres Freund
On Thu, May 12, 2022 at 09:33:05AM -0700, Andres Freund wrote: > On 2022-05-12 12:12:59 -0400, Tom Lane wrote: >> But we have not seen any pgstat crashes lately, so I'm content to mark the >> open item as resolved. > > Cool. Okay, thanks for the feedback. I have marked the item as resolved for the time being. Let's revisit it later if necessary. -- Michael