Hi,
On 2021-10-06 14:11:51 +0800, Craig Ringer wrote:
> On Wed, 6 Oct 2021, 03:30 Andres Freund, <andres@anarazel.de> wrote:
>
> > Hi,
> >
> >
> > My first attempt was to try to use the existing crashdump stuff in
> > pgwin32_install_crashdump_handler(). That's not really quite what I want,
> > because it only handles postmaster rather than any binary, but I thought
> > it'd
> > be a good start. But outside of toy situations it didn't work for me.
> >
>
> Odd. It usually has for me, and definitely not limited to the postmaster.
> But it will fall down for OOM, smashed stack, and other states where
> in-process self-debugging is likely to fail.
I think it's a question of running debug vs optimized builds. At least in
debug builds it doesn't appear to work because the debug c runtime abort
preempts it.
> I patch this out when working on windows because it's a real pain.
>
> I keep meaning to propose that we remove this functionality entirely. It's
> obsolete. It was introduced back in the days where DrWatson.exe "windows
> error reporting") used to launch an interactive prompt asking the user what
> to do when a process crashed. This would block the crashed process from
> exiting, making everything grind to a halt until the user interacted with
> the
> UI. Even for a service process.
> Not fun on a headless or remote server.
Yea, the way we do it right now is definitely not helpful. Especially because
it doesn't actually prevent the "hang" issue - the CRT boxes at least cause
precisely such stalls.
We've had a few CI hangs due to such errors.
> These days Windows handles all this a lot more sensibly, and blocking crash
> reporting is quite obsolete and unhelpful.
From what I've seen it didn't actually get particularly sensible, just
different and more complicated.
From what I've seen one needs at least:
- _set_abort_behavior(_CALL_REPORTFAULT | _WRITE_ABORT_MSG)
- _set_error_mode(_OUT_TO_STDERR)
- _CrtSetReportMode(_CRT_ASSERT/ERROR, _CRTDBG_MODE_FILE | _CRTDBG_MODE_DEBUG)
- SetErrorMode(SEM_FAILCRITICALERRORS)
There's many things this hocuspocus can be called, but sensible isn't among my
word choices for it.
> I'd like to just remove it.
I think we need to remove the SEM_NOGPFAULTERRORBOX, but I don't think we can
remove the SEM_FAILCRITICALERRORS, and I think we need the rest of the above
to prevent error boxes from happening.
I think we ought to actually apply these to all binaries, not just
postgres. One CI hung was due to psql asserting. But there's currently no easy
hook point for all binaries afaict. If we were to introduce something it
should probably be in pgport? But I'd defer to that to a later step.
I've attached a patch implementing these changes.
I also made one run with that intentionally fail (with an Assert(false)), and
with the changes the debugger is invoked and creates a backtrace etc:
https://cirrus-ci.com/task/5447300246929408
(click on crashlog-> at the top)
Greetings,
Andres Freund