Re: Unclear problem reports - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Unclear problem reports
Date
Msg-id 20220205232045.czcyitieoybzmwdb@alap3.anarazel.de
Whole thread Raw
In response to Unclear problem reports  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
Hi,

On 2022-02-02 19:35:36 -0500, Bruce Momjian wrote:
> The Postgres community is great at diagnosing problems and giving users
> feedback.  In most cases, we can either diagnose a problem and give a
> fix, or at least give users a hint at finding the cause.
>
> However, there is a class of problems that are very hard to help with,
> and I have perhaps seen an increasing number of them recently, e.g.:
>
>     https://www.postgresql.org/message-id/17384-f50f2eedf541e512%40postgresql.org
>     https://www.postgresql.org/message-id/CALTnk7uOrMztNfzjNOZe3TdquAXDPD3vZKjWFWj%3D-Fv-gmROUQ%40mail.gmail.com
>
> I consider these as problems that need digging to find the cause, and
> users are usually unable to do sufficient digging, and we don't have
> time to give them instructions, so they never get a reply.
>
> Is there something we can do to improve this situation?

I think some of this can be improved measurably, with moderate effort, by
improving our error messages / diagnostics.


Taking e.g. the second report:

[7804] LOG:  starting PostgreSQL 13.1, compiled by Visual C++ build 1914, 64-bit
[8812] LOG:  invalid primary checkpoint record
[8812] PANIC:  could not locate a valid checkpoint record
[7804] LOG:  startup process (PID 8812) was terminated by exception 0xC0000409
[7804] HINT:  See C include file "ntstatus.h" for a description of the
hexadecimal value.
[7804] LOG:  aborting startup due to startup process failure
[7804] LOG:  database system is shut down

We don't provide any details in this log report that allow to diagnose the
problem:
"invalid primary checkpoint record" doesn't say *why* it's invalid, nor does
it say the location at which the checkpoint record was, nor whether the
cluster was shutdown gracefully before. It could be that the permissions on
the WAL files were wrong, it could be missing files, actually invalid WAL, ...

The while following bit about "startup process ... was terminated" and it's
HINT is useless information, because we actually "knowingly" caused that
PANIC.  Even disregarding that, it certainly doesn't help to list numerical
exception codes and references to ntstatus.h.


The first report is similar, although it's a bit harder to improve:
ERROR: missing chunk number 0 for toast value 152073604 in pg_toast_2619

Easy to fix: We don't mention the table that pg_toast_2619 corresponds to -
something we can determine, rather forcing the user to figure out how to do
so.

Harder to fix: We don't provide information about where the reference to the
toast value comes from. In general we don't necessarily know that, because
Datum's are handed around fairly widely. But in a lot of cases we actually
could know.


> Should we just tell them they need to hire a Postgres expert?  I assume
> these are users who do not already have access to such experts.

I think these are more our fault than our users :(

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Unclear problem reports
Next
From: Andres Freund
Date:
Subject: Re: [BUG]Update Toast data failure in logical replication