Thread: Crash in new pgstats code

Crash in new pgstats code

From
Tom Lane
Date:
mylodon just showed a new-to-me failure mode [1]:

Core was generated by `postgres: cascade: startup recovering 000000010000000000000002                '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49    ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x00007f8b8db2d546 in __GI_abort () at abort.c:79
#2  0x000000000098a4dc in ExceptionalCondition (conditionName=<optimized out>, errorType=0x9e8061 "FailedAssertion",
fileName=0xaf811f"/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/lib/dshash.c",
lineNumber=lineNumber@entry=744)at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/error/assert.c:69
#3  0x00000000006dbe65 in dshash_delete_current (status=status@entry=0x7fffec732dc8) at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/lib/dshash.c:744
#4  0x000000000085f911 in pgstat_free_entry (shent=0x7f8b8b0fc320, hstat=0x7fffec732dc8) at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:741
#5  pgstat_drop_entry_internal (shent=0x7f8b8b0fc320, hstat=hstat@entry=0x7fffec732dc8) at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:773
#6  0x000000000085fa2e in pgstat_drop_all_entries () at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat_shmem.c:887
#7  0x0000000000859301 in pgstat_reset_after_failure () at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat.c:1631
#8  pgstat_discard_stats () at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/utils/activity/pgstat.c:435
#9  0x0000000000555ae0 in StartupXLOG () at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/access/transam/xlog.c:5127
#10 0x00000000007a8ece in StartupProcessMain () at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/startup.c:267
#11 0x000000000079f44e in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess) at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/auxprocess.c:141
#12 0x00000000007a5891 in StartChildProcess (type=type@entry=StartupProcess) at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5417
#13 0x00000000007a3ea0 in PostmasterMain (argc=argc@entry=4, argv=<optimized out>, argv@entry=0x1d1cc90) at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1457
#14 0x00000000006f1b91 in main (argc=4, argv=0x1d1cc90) at
/mnt/resource/bf/build/mylodon/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:202
$1 = {si_signo = 6, si_errno = 0, si_code = -6, _sifields = {_pad = {675836, 1001, 0 <repeats 26 times>}, _kill =
{si_pid= 675836, si_uid = 1001}, _timer = {si_tid = 675836, si_overrun = 1001, si_sigval = {sival_int = 0, sival_ptr =
0x0}},_rt = {si_pid = 675836, si_uid = 1001, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid =
675836,si_uid = 1001, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x3e9000a4ffc, _addr_lsb = 0,
_addr_bnd= {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 4299262939132, si_fd = 0}}} 

Probably deserves investigation.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2022-04-15%2011%3A51%3A35



Re: Crash in new pgstats code

From
Tom Lane
Date:
I wrote:
> mylodon just showed a new-to-me failure mode [1]:

Another occurrence here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-15%2022%3A42%3A07

I've added an open item.

            regards, tom lane



Re: Crash in new pgstats code

From
Andres Freund
Date:
Hi

On 2022-04-15 13:28:35 -0400, Tom Lane wrote:
> mylodon just showed a new-to-me failure mode [1]:

Thanks. Found the bug (pgstat_drop_all_entries() passed the wrong lock
level), with the obvious fix.

This failed to fail in other tests because they all end up resetting
only when there's no stats. It's not too hard to write a test for that,
which is how I reproduced the issue.

I'm planning to make it a bit easier to test by verifying that 'E' in
pgstat_read_statsfile() actually is just before EOF. That seems like a
good check anyway.


What confuses me so far is what already had generated stats before
reaching pgstat_reset_after_failure() (so that the bug could even be hit
in t/025_stuck_on_old_timeline.pl).


Greetings,

Andres Freund



Re: Crash in new pgstats code

From
Andres Freund
Date:
Hi

On 2022-04-16 12:13:09 -0700, Andres Freund wrote:
> What confuses me so far is what already had generated stats before
> reaching pgstat_reset_after_failure() (so that the bug could even be hit
> in t/025_stuck_on_old_timeline.pl).

I see part of a problem - in archiver stats. Even in 14 (and presumably
before), we do work that can generate archiver stats
(e.g. ReadCheckpointRecord()) before pgstat_reset_all().  It's not the
end of the world, but doesn't seem great.

But since archiver stats are fixed-numbered stats (and thus not in the
hash table), they'd not trigger the backtrace we saw here.


One thing that's interesting is that the failing tests have:
2022-04-15 12:07:48.828 UTC [675922][walreceiver][:0] FATAL:  could not link file "pg_wal/xlogtemp.675922" to
"pg_wal/00000002.history":File exists
 

which I haven't seen locally. Looks like we have some race between
startup process and walreceiver? That seems not great.  I'm a bit
confused that walreceiver and archiving are both active at the same time
in the first place - that doesn't seem right as things are set up
currently.


Greetings,

Andres Freund



Re: Crash in new pgstats code

From
Andres Freund
Date:
Hi,

On 2022-04-16 12:13:09 -0700, Andres Freund wrote:
> On 2022-04-15 13:28:35 -0400, Tom Lane wrote:
> > mylodon just showed a new-to-me failure mode [1]:
> 
> Thanks. Found the bug (pgstat_drop_all_entries() passed the wrong lock
> level), with the obvious fix.
> 
> This failed to fail in other tests because they all end up resetting
> only when there's no stats. It's not too hard to write a test for that,
> which is how I reproduced the issue.
> 
> I'm planning to make it a bit easier to test by verifying that 'E' in
> pgstat_read_statsfile() actually is just before EOF. That seems like a
> good check anyway.

I've pushed that fix.


> What confuses me so far is what already had generated stats before
> reaching pgstat_reset_after_failure() (so that the bug could even be hit
> in t/025_stuck_on_old_timeline.pl).

But there's still things I don't understand about that aspect.

Greetings,

Andres Freund



Re: Crash in new pgstats code

From
Michael Paquier
Date:
On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote:
> which I haven't seen locally. Looks like we have some race between
> startup process and walreceiver? That seems not great.  I'm a bit
> confused that walreceiver and archiving are both active at the same time
> in the first place - that doesn't seem right as things are set up
> currently.

Yeah, that should be exclusively one or the other, never both.
WaitForWALToBecomeAvailable() would be a hot spot when it comes to
decide when a WAL receiver should be spawned by the startup process.
Except from the recent refactoring of xlog.c or the WAL prefetch work,
there has not been many changes in this area lately.
--
Michael

Attachment

Re: Crash in new pgstats code

From
Thomas Munro
Date:
On Mon, Apr 18, 2022 at 7:19 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote:
> > which I haven't seen locally. Looks like we have some race between
> > startup process and walreceiver? That seems not great.  I'm a bit
> > confused that walreceiver and archiving are both active at the same time
> > in the first place - that doesn't seem right as things are set up
> > currently.
>
> Yeah, that should be exclusively one or the other, never both.
> WaitForWALToBecomeAvailable() would be a hot spot when it comes to
> decide when a WAL receiver should be spawned by the startup process.
> Except from the recent refactoring of xlog.c or the WAL prefetch work,
> there has not been many changes in this area lately.

Hmm, well I'm not sure what is happening here and will try to dig
tomorrow, but one observation from some log scraping is that kestrel
logged similar output with "could not link file" several times before
the main prefetching commit (5dc0418).  I looked back 3 months on
kestrel/HEAD and found these:

 commit  |                                                        log

---------+-------------------------------------------------------------------------------------------------------------------
 411b913 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-27%2010:57:20&stg=recovery-check
 3d067c5 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-29%2017:52:32&stg=recovery-check
 cd7ea75 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-30%2015:25:03&stg=recovery-check
 8e053dc |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-03-30%2020:27:44&stg=recovery-check
 4e34747 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-04%2020:32:24&stg=recovery-check
 01effb1 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-06%2007:32:40&stg=recovery-check
 fbfe691 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-07%2005:10:05&stg=recovery-check
 5dc0418 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-07%2007:51:00&stg=recovery-check
 bd037dc |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-11%2022:00:58&stg=recovery-check
 a4b5754 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-12%2004:40:44&stg=recovery-check
 7129a97 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-15%2022:42:07&stg=recovery-check
 9f4f0a0 |
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2022-04-16%2020:05:34&stg=recovery-check



Re: Crash in new pgstats code

From
Andres Freund
Date:
Hi,

On 2022-04-18 22:45:07 +1200, Thomas Munro wrote:
> On Mon, Apr 18, 2022 at 7:19 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Sat, Apr 16, 2022 at 02:36:33PM -0700, Andres Freund wrote:
> > > which I haven't seen locally. Looks like we have some race between
> > > startup process and walreceiver? That seems not great.  I'm a bit
> > > confused that walreceiver and archiving are both active at the same time
> > > in the first place - that doesn't seem right as things are set up
> > > currently.
> >
> > Yeah, that should be exclusively one or the other, never both.
> > WaitForWALToBecomeAvailable() would be a hot spot when it comes to
> > decide when a WAL receiver should be spawned by the startup process.
> > Except from the recent refactoring of xlog.c or the WAL prefetch work,
> > there has not been many changes in this area lately.
> 
> Hmm, well I'm not sure what is happening here and will try to dig
> tomorrow, but one observation from some log scraping is that kestrel
> logged similar output with "could not link file" several times before
> the main prefetching commit (5dc0418).  I looked back 3 months on
> kestrel/HEAD and found these:

Kestrel won't go that far back even - I set it up 23 days ago...

I'm formally on vacation till Thursday, I'll try to look at earlier
instances then. Unless it's already figured out :). I failed at
reproducing it locally, despite a fair bit of effort.

The BF really should break out individual tests into their own stage
logs. The recovery-check stage is 13MB and 150k lines by now.

Greetings,

Andres Freund



Re: Crash in new pgstats code

From
Thomas Munro
Date:
On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote:
> Kestrel won't go that far back even - I set it up 23 days ago...

Here's a ~6 month old example from mylodon (I can't see much further
back than that with HTTP requests... I guess BF records are purged?):

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mylodon&dt=2021-10-19%2022%3A57%3A54&stg=recovery-check



Re: Crash in new pgstats code

From
Michael Paquier
Date:
On Tue, Apr 19, 2022 at 08:31:05PM +1200, Thomas Munro wrote:
> On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote:
> > Kestrel won't go that far back even - I set it up 23 days ago...
>
> Here's a ~6 month old example from mylodon (I can't see much further
> back than that with HTTP requests... I guess BF records are purged?):
>
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mylodon&dt=2021-10-19%2022%3A57%3A54&stg=recovery-check

Do we have anything remaining on this thread in light of the upcoming
beta1?  One fix has been pushed upthread, but it does not seem we are
completely in the clear either.
--
Michael

Attachment

Re: Crash in new pgstats code

From
Andres Freund
Date:
Hi,

On 2022-05-11 15:46:13 +0900, Michael Paquier wrote:
> On Tue, Apr 19, 2022 at 08:31:05PM +1200, Thomas Munro wrote:
> > On Tue, Apr 19, 2022 at 2:50 AM Andres Freund <andres@anarazel.de> wrote:
> > > Kestrel won't go that far back even - I set it up 23 days ago...
> > 
> > Here's a ~6 month old example from mylodon (I can't see much further
> > back than that with HTTP requests... I guess BF records are purged?):
> > 
> >
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mylodon&dt=2021-10-19%2022%3A57%3A54&stg=recovery-check
> 
> Do we have anything remaining on this thread in light of the upcoming
> beta1?  One fix has been pushed upthread, but it does not seem we are
> completely in the clear either.

I don't know what else there is to do, tbh.

Greetings,

Andres Freund



Re: Crash in new pgstats code

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-05-11 15:46:13 +0900, Michael Paquier wrote:
>> Do we have anything remaining on this thread in light of the upcoming
>> beta1?  One fix has been pushed upthread, but it does not seem we are
>> completely in the clear either.

> I don't know what else there is to do, tbh.

Well, it was mostly you expressing misgivings upthread ;-).  But we
have not seen any pgstat crashes lately, so I'm content to mark the
open item as resolved.

            regards, tom lane



Re: Crash in new pgstats code

From
Andres Freund
Date:
Hi,

On 2022-05-12 12:12:59 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-05-11 15:46:13 +0900, Michael Paquier wrote:
> >> Do we have anything remaining on this thread in light of the upcoming
> >> beta1?  One fix has been pushed upthread, but it does not seem we are
> >> completely in the clear either.
> 
> > I don't know what else there is to do, tbh.
> 
> Well, it was mostly you expressing misgivings upthread ;-).

Those mostly were about stuff in 14 as well... I guess it'd be good to figure
out precisely how the problem was triggered, but without further information
I don't quite see how to figure it out...


> But we have not seen any pgstat crashes lately, so I'm content to mark the
> open item as resolved.

Cool.

Greetings,

Andres Freund



Re: Crash in new pgstats code

From
Michael Paquier
Date:
On Thu, May 12, 2022 at 09:33:05AM -0700, Andres Freund wrote:
> On 2022-05-12 12:12:59 -0400, Tom Lane wrote:
>> But we have not seen any pgstat crashes lately, so I'm content to mark the
>> open item as resolved.
>
> Cool.

Okay, thanks for the feedback.  I have marked the item as resolved for
the time being.  Let's revisit it later if necessary.
--
Michael

Attachment