Thread: Better HINT message for "unexpected data beyond EOF"

Better HINT message for "unexpected data beyond EOF"

From
Jakub Wartak
Date:
I would like to propose that we tweak the following error message:

ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
HINT:  This has been seen to occur with buggy kernels; consider
updating your system.

to something more generic and less confusing. It is coming from
ffae5cc5a602 (2006), and we are probably not running those "buggy"
kernels anywhere. I've seen this error multiple times, but it is
usually due to some external influence overwriting/replacing the files
in PGDATA and some (potentially new) backends open()ing those "new"
files and finding unexpected file layout. In the real world this means
usually:
a. files being potentially accidentally replaced/overwritten, please
see attached file for reproducer
b. some obscure bugs (e.g. in EPAS - PG fork - we have on-demand
automatic partition creation and we had bug/race conditions where
multiple backends end up writing to the same relfilenode oid file)

so how about:
-HINT:  This has been seen to occur with buggy kernels; consider
updating your system.
+HINT:  This has been observed with files being overwritten, buggy
kernels and potentially other external file system influence.

?

-J.

Attachment

Re: Better HINT message for "unexpected data beyond EOF"

From
Robert Haas
Date:
On Wed, Mar 26, 2025 at 4:59 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> HINT:  This has been seen to occur with buggy kernels; consider
> updating your system.
>
> to something more generic and less confusing. It is coming from
> ffae5cc5a602 (2006), and we are probably not running those "buggy"
> kernels anywhere. I've seen this error multiple times, but it is
> usually due to some external influence overwriting/replacing the files
> in PGDATA and some (potentially new) backends open()ing those "new"
> files and finding unexpected file layout. In the real world this means
> usually:
> a. files being potentially accidentally replaced/overwritten, please
> see attached file for reproducer
> b. some obscure bugs (e.g. in EPAS - PG fork - we have on-demand
> automatic partition creation and we had bug/race conditions where
> multiple backends end up writing to the same relfilenode oid file)
>
> so how about:
> -HINT:  This has been seen to occur with buggy kernels; consider
> updating your system.
> +HINT:  This has been observed with files being overwritten, buggy
> kernels and potentially other external file system influence.

I agree that we should emphasize the possibility of files being
overwritten. I'm not sure we should even mention buggy kernels -- is
there any evidence that's still a thing on still-running hardware? I
don't really like "other external file system influence" because that
sounds like vague weasel-wording.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Better HINT message for "unexpected data beyond EOF"

From
Jakub Wartak
Date:
On Wed, Mar 26, 2025 at 4:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
[..]
> > so how about:
> > -HINT:  This has been seen to occur with buggy kernels; consider
> > updating your system.
> > +HINT:  This has been observed with files being overwritten, buggy
> > kernels and potentially other external file system influence.
>
> I agree that we should emphasize the possibility of files being
> overwritten.

> I'm not sure we should even mention buggy kernels -- is
> there any evidence that's still a thing on still-running hardware?

No, I do not have any, other than comments in source code from Tom.

> I don't really like "other external file system influence" because that
> sounds like vague weasel-wording.

That was somehow intended, because I did not want to rule out any
external factor(s) and state it as vaguely as possible to stay
generic, because it is literally "paranormal" / "rogue" activity
happening from perspective of the core server itself (another entity
opening and overwriting data files) , but I suppose bugs or in some
cases fs corruption could cause it too ?)

E.g. I've tracked down that e.g. Pavan fixed something in 2ndQ
fast_redo/pg_xlog_prefetch extension in 2016, where some concurrency
bug in that extension was causing similiar problem back then on at
least one occasion: ```...issue was caused because the prefetch worker
process reading back blocks that are being concurrently dropped by the
startup process (as a result of truncate operation). When the startup
process later tries to extend the relation, it finds an existing valid
block in the shared buffers and panics. ``` (sounds like it is related
with data beyond EOF).

Proposals:
1. HINT:  This has been observed with files being overwritten.
2. HINT:  This has been observed with files being overwritten, old
(2.6.x) buggy Linux kernels .
3. HINT:  This has been observed with files being overwritten, old
(2.6.x) buggy Linux kernels, corruption or other non-core PostgreSQL
bugs.
4. HINT:  This has been observed with files being overwritten, buggy
kernels and potentially other external file system influence.

TBH, anything else is better that simply avoids blaming kernel folks
directly, but as a non-native speaker I'm finding it a little hard to
articulate.

-J.



Re: Better HINT message for "unexpected data beyond EOF"

From
Andres Freund
Date:
Hi,

On 2025-03-27 10:25:50 +0100, Jakub Wartak wrote:
> On Wed, Mar 26, 2025 at 4:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
> [..]
> > > so how about:
> > > -HINT:  This has been seen to occur with buggy kernels; consider
> > > updating your system.
> > > +HINT:  This has been observed with files being overwritten, buggy
> > > kernels and potentially other external file system influence.
> >
> > I agree that we should emphasize the possibility of files being
> > overwritten.
> 
> > I'm not sure we should even mention buggy kernels -- is
> > there any evidence that's still a thing on still-running hardware?
> 
> No, I do not have any, other than comments in source code from Tom.

FWIW, I'm not sure how much that was ever true. We certainly had our own bugs
that could lead to the error occurring.


> E.g. I've tracked down that e.g. Pavan fixed something in 2ndQ
> fast_redo/pg_xlog_prefetch extension in 2016, where some concurrency
> bug in that extension was causing similiar problem back then on at
> least one occasion: ```...issue was caused because the prefetch worker
> process reading back blocks that are being concurrently dropped by the
> startup process (as a result of truncate operation). When the startup
> process later tries to extend the relation, it finds an existing valid
> block in the shared buffers and panics. ``` (sounds like it is related
> with data beyond EOF).

FWIW that's more generally broken than just this error. You can't just read in
data without holding a lock on a relation, that will cause breakage in all
kinds of ways.


> Proposals:
> 1. HINT:  This has been observed with files being overwritten.
> 2. HINT:  This has been observed with files being overwritten, old
> (2.6.x) buggy Linux kernels .
> 3. HINT:  This has been observed with files being overwritten, old
> (2.6.x) buggy Linux kernels, corruption or other non-core PostgreSQL
> bugs.
> 4. HINT:  This has been observed with files being overwritten, buggy
> kernels and potentially other external file system influence.

FWIW, I think we should just drop the HINT. We really have no clue what caused
it and a HINT should imo have at least some value other than "*Shrug*", which
is imo pretty much what these HINTs amount to, if they were a bit more blunt.

Greetings,

Andres Freund



Re: Better HINT message for "unexpected data beyond EOF"

From
Robert Haas
Date:
On Thu, Mar 27, 2025 at 10:12 AM Andres Freund <andres@anarazel.de> wrote:
> FWIW, I think we should just drop the HINT. We really have no clue what caused
> it and a HINT should imo have at least some value other than "*Shrug*", which
> is imo pretty much what these HINTs amount to, if they were a bit more blunt.

I think that would be better than what we have now, but I still wonder
if we should give some kind of a hint that an external process may be
doing something to that file. Jakub and I may be biased by having just
seen a case of exactly that in the field, but I wonder now how many
'data beyond EOF' messages are exactly that -- and it's not like the
user is going to guess that 'data beyond EOF' might mean that such a
thing occurred.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Better HINT message for "unexpected data beyond EOF"

From
Christoph Berg
Date:
Re: Robert Haas
> I think that would be better than what we have now, but I still wonder
> if we should give some kind of a hint that an external process may be
> doing something to that file. Jakub and I may be biased by having just
> seen a case of exactly that in the field, but I wonder now how many
> 'data beyond EOF' messages are exactly that -- and it's not like the
> user is going to guess that 'data beyond EOF' might mean that such a
> thing occurred.

HINT:  Did anything besides PostgreSQL touch that file?

Christoph



Re: Better HINT message for "unexpected data beyond EOF"

From
Jakub Wartak
Date:
On Thu, Mar 27, 2025 at 4:00 PM Christoph Berg <myon@debian.org> wrote:
>
> Re: Robert Haas
> > I think that would be better than what we have now, but I still wonder
> > if we should give some kind of a hint that an external process may be
> > doing something to that file. Jakub and I may be biased by having just
> > seen a case of exactly that in the field, but I wonder now how many
> > 'data beyond EOF' messages are exactly that -- and it's not like the
> > user is going to guess that 'data beyond EOF' might mean that such a
> > thing occurred.
>
> HINT:  Did anything besides PostgreSQL touch that file?

Thread bump. So we have the following candidates:

1. remove it as Andres stated:
ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387

2a. Robert's idea
ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
HINT:  This has been observed with PostgreSQL files being overwritten.

2b. Christoph's idea
ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
HINT:  Did anything besides PostgreSQL touch that file?

Anything else? #1 has one advantage that we don't need to provide 11
translations inside src/backend/po/*.po (I could use google translate
when proposing patch, but I do not take any responsibility for what it
generates ;))

Another question is should we back-patch this? I believe we should (?)

-J.



Re: Better HINT message for "unexpected data beyond EOF"

From
Robert Haas
Date:
On Tue, Apr 1, 2025 at 7:13 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> Thread bump. So we have the following candidates:
>
> 1. remove it as Andres stated:
> ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
>
> 2a. Robert's idea
> ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> HINT:  This has been observed with PostgreSQL files being overwritten.
>
> 2b. Christoph's idea
> ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> HINT:  Did anything besides PostgreSQL touch that file?

I don't think I proposed that exact phrasing - I prefer (2b) over
(2a), although I would replace "besides" with "other than".

> Another question is should we back-patch this? I believe we should (?)

I don't think this qualifies as a bug. The current wording isn't
factually wrong, just unhelpful. Even if it were wrong, we need a
pretty good reason to change message strings in a stable branch,
because that can break things for users who are grepping for the
current string (or a translation thereof). If an overwhelming
consensus in favor of back-patching emerges, fine, but my gut feeling
is that back-patching will make more people sad than it makes happy.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Better HINT message for "unexpected data beyond EOF"

From
Christoph Berg
Date:
Re: Robert Haas
> > Another question is should we back-patch this? I believe we should (?)
> 
> I don't think this qualifies as a bug. The current wording isn't
> factually wrong, just unhelpful. Even if it were wrong, we need a
> pretty good reason to change message strings in a stable branch,
> because that can break things for users who are grepping for the
> current string (or a translation thereof). If an overwhelming
> consensus in favor of back-patching emerges, fine, but my gut feeling
> is that back-patching will make more people sad than it makes happy.

It's only the HINT part. If I were to grep/search for the message, I
would definitely use the message part.

Christoph



Re: Better HINT message for "unexpected data beyond EOF"

From
Robert Haas
Date:
On Tue, Apr 1, 2025 at 9:54 AM Christoph Berg <myon@debian.org> wrote:
> Re: Robert Haas
> > > Another question is should we back-patch this? I believe we should (?)
> > I don't think this qualifies as a bug. The current wording isn't
> > factually wrong, just unhelpful. Even if it were wrong, we need a
> > pretty good reason to change message strings in a stable branch,
> > because that can break things for users who are grepping for the
> > current string (or a translation thereof). If an overwhelming
> > consensus in favor of back-patching emerges, fine, but my gut feeling
> > is that back-patching will make more people sad than it makes happy.
>
> It's only the HINT part. If I were to grep/search for the message, I
> would definitely use the message part.

I'm sure you would, but you're very smart.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Better HINT message for "unexpected data beyond EOF"

From
Andres Freund
Date:
Hi,

On 2025-04-01 09:49:12 -0400, Robert Haas wrote:
> On Tue, Apr 1, 2025 at 7:13 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > Thread bump. So we have the following candidates:
> >
> > 1. remove it as Andres stated:
> > ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> >
> > 2a. Robert's idea
> > ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> > HINT:  This has been observed with PostgreSQL files being overwritten.
> >
> > 2b. Christoph's idea
> > ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> > HINT:  Did anything besides PostgreSQL touch that file?

FWIW, I think these are all just about equally wrong.
1) doesn't allow the use to understand what could be the culprit
2*) omit that zero_damaged_pages can cause this due to the logic in mdreadv()


> > Another question is should we back-patch this? I believe we should (?)
> 
> I don't think this qualifies as a bug. The current wording isn't
> factually wrong, just unhelpful. Even if it were wrong, we need a
> pretty good reason to change message strings in a stable branch,
> because that can break things for users who are grepping for the
> current string (or a translation thereof). If an overwhelming
> consensus in favor of back-patching emerges, fine, but my gut feeling
> is that back-patching will make more people sad than it makes happy.

I'd certainly not backpatch.

Greetings,

Andres Freund



Re: Better HINT message for "unexpected data beyond EOF"

From
Jakub Wartak
Date:
On Tue, Apr 1, 2025 at 3:59 PM Andres Freund <andres@anarazel.de> wrote:

Hi Robert, Andres, Christoph,

> On 2025-04-01 09:49:12 -0400, Robert Haas wrote:
> > On Tue, Apr 1, 2025 at 7:13 AM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> > > Thread bump. So we have the following candidates:
> > >
> > > 1. remove it as Andres stated:
> > > ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> > >
> > > 2a. Robert's idea
> > > ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> > > HINT:  This has been observed with PostgreSQL files being overwritten.
> > >
> > > 2b. Christoph's idea
> > > ERROR:  unexpected data beyond EOF in block 1472 of relation base/5/16387
> > > HINT:  Did anything besides PostgreSQL touch that file?
>
> FWIW, I think these are all just about equally wrong.
> 1) doesn't allow the use to understand what could be the culprit

Well, that's pretty easy: tablespace relations were overwritten live
(PITR on the same host, w/o tablespace remapping). This assumes you
know that this restore is happening in the first place.

> 2*) omit that zero_damaged_pages can cause this due to the logic in mdreadv()

Saw 00066aa173 [1], but zero_damaged_pages use is non-existent
(outside of handling corruption cases), right?

> > > Another question is should we back-patch this? I believe we should (?)
> >
> > I don't think this qualifies as a bug. The current wording isn't
> > factually wrong, just unhelpful.

I think it is highly misleading and not up to modern times, it
certainly had value in the past.
I cannot comment from others perspective, but it has sent me in the
past into literally cross-checking if Linux's
lseek() system call vector has not been replaced by some LKMs (in some
cases it was...).

So yes I agree, it would be *better* if it wasn't present in the first
place in modern times.

> > Even if it were wrong, we need a
> > pretty good reason to change message strings in a stable branch,
> > because that can break things for users who are grepping for the
> > current string (or a translation thereof). If an overwhelming
> > consensus in favor of back-patching emerges, fine, but my gut feeling
> > is that back-patching will make more people sad than it makes happy.
>
> I'd certainly not backpatch.

There goes my plan... I would recommend backpatching, but I'm alone
and outvoted by more experienced people :^)

OK, so attached is a small patch to eradicate this HINT: CI tested,
verified using original reproducer, registered in cf app.

-J.

[1] - https://github.com/postgres/postgres/commit/00066aa1733d84109f7569a7202c3915d8289d3a

Attachment