Re: backtrace_on_internal_error - Mailing list pgsql-hackers

From Andres Freund
Subject Re: backtrace_on_internal_error
Date
Msg-id 20231208181451.deqnflwxqoehhxpe@awork3.anarazel.de
Whole thread Raw
In response to Re: backtrace_on_internal_error  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: backtrace_on_internal_error
Re: backtrace_on_internal_error
List pgsql-hackers
Hi,

On 2023-12-08 10:05:09 -0500, Tom Lane wrote:
> Peter Eisentraut <peter@eisentraut.org> writes:
> > One possible question for discussion is whether the default for this
> > should be off, on, or possibly something like on-in-assert-builds.
> > (Personally, I'm happy to turn it on myself at run time, but everyone
> > has different workflows.)
>
> ... there was already opinion upthread that this should be on by
> default, which I agree with.  You shouldn't be hitting cases like
> this commonly (if so, they're bugs to fix or the errcode should be
> rethought), and the failure might be pretty hard to reproduce.

FWIW, I did some analysis on aggregated logs on a larger number of machines,
and it does look like that'd be a measurable increase in log volume. There are
a few voluminous internal errors in core, but the bigger issue is
extensions. They are typically much less disciplined about assigning error
codes than core PG is.

I've been wondering about doing some macro hackery to inform elog.c about
whether a log message is from core or an extension. It might even be possible
to identify the concrete extension, e.g. by updating the contents of
PG_MODULE_MAGIC during module loading, and referencing that.


Based on the aforementioned data, the most common, in-core, log messages
without assigned error codes are:

could not accept SSL connection: %m - with zero errno
archive command was terminated by signal %d: %s
could not send data to client: %m - with zero errno
cache lookup failed for type %u
archive command failed with exit code %d
tuple concurrently updated
could not restore file "%s" from archive: %s
archive command was terminated by signal %d: %s
%s at file "%s" line %u
invalid memory alloc request size %zu
could not send data to client: %m
could not open directory "%s": %m - errno indicating ENOMEM
could not write init file
out of relcache_callback_list slots
online backup was canceled, recovery cannot continue
requested timeline %u does not contain minimum recovery point %X/%X on timeline %u


There were a lot more in older PG versions, I tried to filter those out.


I'm a bit confused about the huge number of "could not accept SSL connection:
%m" with a zero errno. I guess we must be clearing errno somehow, but I don't
immediately see where. Or perhaps we need to actually look at what
SSL_get_error() returns?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'
Next
From: Tom Lane
Date:
Subject: Re: backtrace_on_internal_error