Thread: Misleading panic message in backend/access/transam/xlog.c

Misleading panic message in backend/access/transam/xlog.c

From
Michael Banck
Date:
Hi,

there was a report in #postgresql recently about a crash on Google Cloud
SQL with the somewhat misleading message "could not write to log file"
while in fact it was the xlog/wal:

|PANIC:  could not write to log file 000000010000019600000054 at offset
| 13279232, length 245760: Cannot allocate memory 
|ERROR:  could not write block 74666 in file "base/18031/48587": Cannot
| allocate memory 
|CONTEXT:  writing block 74666 of relation base/18031/48587 
|LOG:  server process (PID 5160) was terminated by signal 9: Killed 

The slightly longer logfile can be found here: http://dpaste.com/2T61PS9

I suggest to reword that message, e.g. "could not write to transaction
log file" or "could not write to wal file".

Also, the errno (ENOMEM) is curious (and the user wrote that Google
monitoring reported memory at 16/20GB at the time of the crash), but it
could be a due to running on a cloud-fork? As you have no access to
PGDATA, it sounds difficult to diagnose after the fact.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


Re: Misleading panic message in backend/access/transam/xlog.c

From
Magnus Hagander
Date:

On Wed, Jan 9, 2019 at 12:06 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,

there was a report in #postgresql recently about a crash on Google Cloud
SQL with the somewhat misleading message "could not write to log file"
while in fact it was the xlog/wal:

|PANIC:  could not write to log file 000000010000019600000054 at offset
| 13279232, length 245760: Cannot allocate memory 
|ERROR:  could not write block 74666 in file "base/18031/48587": Cannot
| allocate memory 
|CONTEXT:  writing block 74666 of relation base/18031/48587
|LOG:  server process (PID 5160) was terminated by signal 9: Killed

The slightly longer logfile can be found here: http://dpaste.com/2T61PS9

I suggest to reword that message, e.g. "could not write to transaction
log file" or "could not write to wal file".

Given the change xlog -> wal, I would suggest "could not write to wal file" as the correct option there.

And +1 for rewording it. I think there are also some other messages like it that needs to be changed, and also things like

(errmsg("restored log file \"%s\" from archive"

could do with an update.


Also, the errno (ENOMEM) is curious (and the user wrote that Google
monitoring reported memory at 16/20GB at the time of the crash), but it
could be a due to running on a cloud-fork? As you have no access to
PGDATA, it sounds difficult to diagnose after the fact.

Yeah, nobody knows what Google has done in their fork *or* how they actually measure those things, so without a repro I think that's hard..


//Magnus

Re: Misleading panic message in backend/access/transam/xlog.c

From
Andres Freund
Date:
Hi,

On 2019-01-09 12:06:39 +0100, Michael Banck wrote:
> there was a report in #postgresql recently about a crash on Google Cloud
> SQL with the somewhat misleading message "could not write to log file"
> while in fact it was the xlog/wal:
> 
> |PANIC:  could not write to log file 000000010000019600000054 at offset
> | 13279232, length 245760: Cannot allocate memory 
> |ERROR:  could not write block 74666 in file "base/18031/48587": Cannot
> | allocate memory 
> |CONTEXT:  writing block 74666 of relation base/18031/48587 
> |LOG:  server process (PID 5160) was terminated by signal 9: Killed 
> 
> The slightly longer logfile can be found here: http://dpaste.com/2T61PS9
> 
> I suggest to reword that message, e.g. "could not write to transaction
> log file" or "could not write to wal file".

I'm quite unenthused about that. If anything, I'd remove detail and use
the standard error message about not being able to write to a file, and
include the full path.


> Also, the errno (ENOMEM) is curious (and the user wrote that Google
> monitoring reported memory at 16/20GB at the time of the crash), but it
> could be a due to running on a cloud-fork? As you have no access to
> PGDATA, it sounds difficult to diagnose after the fact.

Yes, that sounds quite likely. This pretty much is a write() which isn't
documented to return ENOMEM commonly, so I assume Google's doing
something odd.

Greetings,

Andres Freund


Re: Misleading panic message in backend/access/transam/xlog.c

From
Michael Paquier
Date:
On Wed, Jan 09, 2019 at 08:10:43AM -0800, Andres Freund wrote:
> I'm quite unenthused about that. If anything, I'd remove detail and use
> the standard error message about not being able to write to a file, and
> include the full path.

Partially agreed.  Those messages have been left out of 56df07b
because they include some context about the offset and the length, and
I don't think that we simply want to remove that information.  What
about making the offset and the length part of an extra errdetail, and
switch the main error string to a more generic one?
--
Michael

Attachment

Re: Misleading panic message in backend/access/transam/xlog.c

From
Andres Freund
Date:

On January 9, 2019 5:01:40 PM PST, Michael Paquier <michael@paquier.xyz> wrote:
>On Wed, Jan 09, 2019 at 08:10:43AM -0800, Andres Freund wrote:
>> I'm quite unenthused about that. If anything, I'd remove detail and
>use
>> the standard error message about not being able to write to a file,
>and
>> include the full path.
>
>Partially agreed.  Those messages have been left out of 56df07b
>because they include some context about the offset and the length, and
>I don't think that we simply want to remove that information.  What
>about making the offset and the length part of an extra errdetail, and
>switch the main error string to a more generic one?

IIRC we have other such errors including offset and length (and if not we'll grow some). It should be formatted as a
geneticwrite error with the file name, no reference to log file, etc, even if there's no precedent. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Misleading panic message in backend/access/transam/xlog.c

From
Michael Paquier
Date:
On Wed, Jan 09, 2019 at 05:09:19PM -0800, Andres Freund wrote:
> IIRC we have other such errors including offset and length (and if
> not we'll grow some). It should be formatted as a genetic write
> error with the file name, no reference to log file, etc, even if
> there's no precedent.

Yeah, there are a couple of them:
access/transam/xlog.c:
errmsg("could not read from log segment %s, offset %u: %m",
access/transam/xlog.c:
errmsg("could not read from log segment %s, offset %u: read %d of %zu",
access/transam/xlogutils.c:
errmsg("could not seek in log segment %s to offset %u: %m"
access/transam/xlogutils.c:
errmsg("could not read from log segment %s, offset %u, length %lu: %m",
replication/walreceiver.c:
errmsg("could not seek in log segment %s to offset %u: %m",
replication/walsender.c:
errmsg("could not seek in log segment %s to offset %u: %m",
replication/walsender.c:
errmsg("could not read from log segment %s, offset %u, length %zu: %m",
replication/walsender.c:
errmsg("could not read from log segment %s, offset %u: read %d of %zu",
--
Michael

Attachment

Re: Misleading panic message in backend/access/transam/xlog.c

From
Robert Haas
Date:
On Wed, Jan 9, 2019 at 8:38 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Jan 09, 2019 at 05:09:19PM -0800, Andres Freund wrote:
> > IIRC we have other such errors including offset and length (and if
> > not we'll grow some). It should be formatted as a genetic write
> > error with the file name, no reference to log file, etc, even if
> > there's no precedent.
>
> Yeah, there are a couple of them:
> access/transam/xlog.c:
> errmsg("could not read from log segment %s, offset %u: %m",

In smgr.c, we have:

"could not read block %u in file \"%s\": %m"

That seems to be the closet thing we have to a generic message
template right now, but it's not entirely generic because it talks
about blocks.  Maybe we should go with something like:

"could not read %u bytes in file \"%s\" at offset %u: %m"

...and use that for both WAL and smgr.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company