Thread: Better HINT message for "unexpected data beyond EOF"
I would like to propose that we tweak the following error message: ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 HINT: This has been seen to occur with buggy kernels; consider updating your system. to something more generic and less confusing. It is coming from ffae5cc5a602 (2006), and we are probably not running those "buggy" kernels anywhere. I've seen this error multiple times, but it is usually due to some external influence overwriting/replacing the files in PGDATA and some (potentially new) backends open()ing those "new" files and finding unexpected file layout. In the real world this means usually: a. files being potentially accidentally replaced/overwritten, please see attached file for reproducer b. some obscure bugs (e.g. in EPAS - PG fork - we have on-demand automatic partition creation and we had bug/race conditions where multiple backends end up writing to the same relfilenode oid file) so how about: -HINT: This has been seen to occur with buggy kernels; consider updating your system. +HINT: This has been observed with files being overwritten, buggy kernels and potentially other external file system influence. ? -J.
Attachment
On Wed, Mar 26, 2025 at 4:59 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > HINT: This has been seen to occur with buggy kernels; consider > updating your system. > > to something more generic and less confusing. It is coming from > ffae5cc5a602 (2006), and we are probably not running those "buggy" > kernels anywhere. I've seen this error multiple times, but it is > usually due to some external influence overwriting/replacing the files > in PGDATA and some (potentially new) backends open()ing those "new" > files and finding unexpected file layout. In the real world this means > usually: > a. files being potentially accidentally replaced/overwritten, please > see attached file for reproducer > b. some obscure bugs (e.g. in EPAS - PG fork - we have on-demand > automatic partition creation and we had bug/race conditions where > multiple backends end up writing to the same relfilenode oid file) > > so how about: > -HINT: This has been seen to occur with buggy kernels; consider > updating your system. > +HINT: This has been observed with files being overwritten, buggy > kernels and potentially other external file system influence. I agree that we should emphasize the possibility of files being overwritten. I'm not sure we should even mention buggy kernels -- is there any evidence that's still a thing on still-running hardware? I don't really like "other external file system influence" because that sounds like vague weasel-wording. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 26, 2025 at 4:01 PM Robert Haas <robertmhaas@gmail.com> wrote: [..] > > so how about: > > -HINT: This has been seen to occur with buggy kernels; consider > > updating your system. > > +HINT: This has been observed with files being overwritten, buggy > > kernels and potentially other external file system influence. > > I agree that we should emphasize the possibility of files being > overwritten. > I'm not sure we should even mention buggy kernels -- is > there any evidence that's still a thing on still-running hardware? No, I do not have any, other than comments in source code from Tom. > I don't really like "other external file system influence" because that > sounds like vague weasel-wording. That was somehow intended, because I did not want to rule out any external factor(s) and state it as vaguely as possible to stay generic, because it is literally "paranormal" / "rogue" activity happening from perspective of the core server itself (another entity opening and overwriting data files) , but I suppose bugs or in some cases fs corruption could cause it too ?) E.g. I've tracked down that e.g. Pavan fixed something in 2ndQ fast_redo/pg_xlog_prefetch extension in 2016, where some concurrency bug in that extension was causing similiar problem back then on at least one occasion: ```...issue was caused because the prefetch worker process reading back blocks that are being concurrently dropped by the startup process (as a result of truncate operation). When the startup process later tries to extend the relation, it finds an existing valid block in the shared buffers and panics. ``` (sounds like it is related with data beyond EOF). Proposals: 1. HINT: This has been observed with files being overwritten. 2. HINT: This has been observed with files being overwritten, old (2.6.x) buggy Linux kernels . 3. HINT: This has been observed with files being overwritten, old (2.6.x) buggy Linux kernels, corruption or other non-core PostgreSQL bugs. 4. HINT: This has been observed with files being overwritten, buggy kernels and potentially other external file system influence. TBH, anything else is better that simply avoids blaming kernel folks directly, but as a non-native speaker I'm finding it a little hard to articulate. -J.
Hi, On 2025-03-27 10:25:50 +0100, Jakub Wartak wrote: > On Wed, Mar 26, 2025 at 4:01 PM Robert Haas <robertmhaas@gmail.com> wrote: > [..] > > > so how about: > > > -HINT: This has been seen to occur with buggy kernels; consider > > > updating your system. > > > +HINT: This has been observed with files being overwritten, buggy > > > kernels and potentially other external file system influence. > > > > I agree that we should emphasize the possibility of files being > > overwritten. > > > I'm not sure we should even mention buggy kernels -- is > > there any evidence that's still a thing on still-running hardware? > > No, I do not have any, other than comments in source code from Tom. FWIW, I'm not sure how much that was ever true. We certainly had our own bugs that could lead to the error occurring. > E.g. I've tracked down that e.g. Pavan fixed something in 2ndQ > fast_redo/pg_xlog_prefetch extension in 2016, where some concurrency > bug in that extension was causing similiar problem back then on at > least one occasion: ```...issue was caused because the prefetch worker > process reading back blocks that are being concurrently dropped by the > startup process (as a result of truncate operation). When the startup > process later tries to extend the relation, it finds an existing valid > block in the shared buffers and panics. ``` (sounds like it is related > with data beyond EOF). FWIW that's more generally broken than just this error. You can't just read in data without holding a lock on a relation, that will cause breakage in all kinds of ways. > Proposals: > 1. HINT: This has been observed with files being overwritten. > 2. HINT: This has been observed with files being overwritten, old > (2.6.x) buggy Linux kernels . > 3. HINT: This has been observed with files being overwritten, old > (2.6.x) buggy Linux kernels, corruption or other non-core PostgreSQL > bugs. > 4. HINT: This has been observed with files being overwritten, buggy > kernels and potentially other external file system influence. FWIW, I think we should just drop the HINT. We really have no clue what caused it and a HINT should imo have at least some value other than "*Shrug*", which is imo pretty much what these HINTs amount to, if they were a bit more blunt. Greetings, Andres Freund
On Thu, Mar 27, 2025 at 10:12 AM Andres Freund <andres@anarazel.de> wrote: > FWIW, I think we should just drop the HINT. We really have no clue what caused > it and a HINT should imo have at least some value other than "*Shrug*", which > is imo pretty much what these HINTs amount to, if they were a bit more blunt. I think that would be better than what we have now, but I still wonder if we should give some kind of a hint that an external process may be doing something to that file. Jakub and I may be biased by having just seen a case of exactly that in the field, but I wonder now how many 'data beyond EOF' messages are exactly that -- and it's not like the user is going to guess that 'data beyond EOF' might mean that such a thing occurred. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Robert Haas > I think that would be better than what we have now, but I still wonder > if we should give some kind of a hint that an external process may be > doing something to that file. Jakub and I may be biased by having just > seen a case of exactly that in the field, but I wonder now how many > 'data beyond EOF' messages are exactly that -- and it's not like the > user is going to guess that 'data beyond EOF' might mean that such a > thing occurred. HINT: Did anything besides PostgreSQL touch that file? Christoph
On Thu, Mar 27, 2025 at 4:00 PM Christoph Berg <myon@debian.org> wrote: > > Re: Robert Haas > > I think that would be better than what we have now, but I still wonder > > if we should give some kind of a hint that an external process may be > > doing something to that file. Jakub and I may be biased by having just > > seen a case of exactly that in the field, but I wonder now how many > > 'data beyond EOF' messages are exactly that -- and it's not like the > > user is going to guess that 'data beyond EOF' might mean that such a > > thing occurred. > > HINT: Did anything besides PostgreSQL touch that file? Thread bump. So we have the following candidates: 1. remove it as Andres stated: ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 2a. Robert's idea ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 HINT: This has been observed with PostgreSQL files being overwritten. 2b. Christoph's idea ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 HINT: Did anything besides PostgreSQL touch that file? Anything else? #1 has one advantage that we don't need to provide 11 translations inside src/backend/po/*.po (I could use google translate when proposing patch, but I do not take any responsibility for what it generates ;)) Another question is should we back-patch this? I believe we should (?) -J.
On Tue, Apr 1, 2025 at 7:13 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > Thread bump. So we have the following candidates: > > 1. remove it as Andres stated: > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > > 2a. Robert's idea > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > HINT: This has been observed with PostgreSQL files being overwritten. > > 2b. Christoph's idea > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > HINT: Did anything besides PostgreSQL touch that file? I don't think I proposed that exact phrasing - I prefer (2b) over (2a), although I would replace "besides" with "other than". > Another question is should we back-patch this? I believe we should (?) I don't think this qualifies as a bug. The current wording isn't factually wrong, just unhelpful. Even if it were wrong, we need a pretty good reason to change message strings in a stable branch, because that can break things for users who are grepping for the current string (or a translation thereof). If an overwhelming consensus in favor of back-patching emerges, fine, but my gut feeling is that back-patching will make more people sad than it makes happy. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Robert Haas > > Another question is should we back-patch this? I believe we should (?) > > I don't think this qualifies as a bug. The current wording isn't > factually wrong, just unhelpful. Even if it were wrong, we need a > pretty good reason to change message strings in a stable branch, > because that can break things for users who are grepping for the > current string (or a translation thereof). If an overwhelming > consensus in favor of back-patching emerges, fine, but my gut feeling > is that back-patching will make more people sad than it makes happy. It's only the HINT part. If I were to grep/search for the message, I would definitely use the message part. Christoph
On Tue, Apr 1, 2025 at 9:54 AM Christoph Berg <myon@debian.org> wrote: > Re: Robert Haas > > > Another question is should we back-patch this? I believe we should (?) > > I don't think this qualifies as a bug. The current wording isn't > > factually wrong, just unhelpful. Even if it were wrong, we need a > > pretty good reason to change message strings in a stable branch, > > because that can break things for users who are grepping for the > > current string (or a translation thereof). If an overwhelming > > consensus in favor of back-patching emerges, fine, but my gut feeling > > is that back-patching will make more people sad than it makes happy. > > It's only the HINT part. If I were to grep/search for the message, I > would definitely use the message part. I'm sure you would, but you're very smart. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2025-04-01 09:49:12 -0400, Robert Haas wrote: > On Tue, Apr 1, 2025 at 7:13 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > Thread bump. So we have the following candidates: > > > > 1. remove it as Andres stated: > > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > > > > 2a. Robert's idea > > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > > HINT: This has been observed with PostgreSQL files being overwritten. > > > > 2b. Christoph's idea > > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > > HINT: Did anything besides PostgreSQL touch that file? FWIW, I think these are all just about equally wrong. 1) doesn't allow the use to understand what could be the culprit 2*) omit that zero_damaged_pages can cause this due to the logic in mdreadv() > > Another question is should we back-patch this? I believe we should (?) > > I don't think this qualifies as a bug. The current wording isn't > factually wrong, just unhelpful. Even if it were wrong, we need a > pretty good reason to change message strings in a stable branch, > because that can break things for users who are grepping for the > current string (or a translation thereof). If an overwhelming > consensus in favor of back-patching emerges, fine, but my gut feeling > is that back-patching will make more people sad than it makes happy. I'd certainly not backpatch. Greetings, Andres Freund
On Tue, Apr 1, 2025 at 3:59 PM Andres Freund <andres@anarazel.de> wrote: Hi Robert, Andres, Christoph, > On 2025-04-01 09:49:12 -0400, Robert Haas wrote: > > On Tue, Apr 1, 2025 at 7:13 AM Jakub Wartak > > <jakub.wartak@enterprisedb.com> wrote: > > > Thread bump. So we have the following candidates: > > > > > > 1. remove it as Andres stated: > > > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > > > > > > 2a. Robert's idea > > > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > > > HINT: This has been observed with PostgreSQL files being overwritten. > > > > > > 2b. Christoph's idea > > > ERROR: unexpected data beyond EOF in block 1472 of relation base/5/16387 > > > HINT: Did anything besides PostgreSQL touch that file? > > FWIW, I think these are all just about equally wrong. > 1) doesn't allow the use to understand what could be the culprit Well, that's pretty easy: tablespace relations were overwritten live (PITR on the same host, w/o tablespace remapping). This assumes you know that this restore is happening in the first place. > 2*) omit that zero_damaged_pages can cause this due to the logic in mdreadv() Saw 00066aa173 [1], but zero_damaged_pages use is non-existent (outside of handling corruption cases), right? > > > Another question is should we back-patch this? I believe we should (?) > > > > I don't think this qualifies as a bug. The current wording isn't > > factually wrong, just unhelpful. I think it is highly misleading and not up to modern times, it certainly had value in the past. I cannot comment from others perspective, but it has sent me in the past into literally cross-checking if Linux's lseek() system call vector has not been replaced by some LKMs (in some cases it was...). So yes I agree, it would be *better* if it wasn't present in the first place in modern times. > > Even if it were wrong, we need a > > pretty good reason to change message strings in a stable branch, > > because that can break things for users who are grepping for the > > current string (or a translation thereof). If an overwhelming > > consensus in favor of back-patching emerges, fine, but my gut feeling > > is that back-patching will make more people sad than it makes happy. > > I'd certainly not backpatch. There goes my plan... I would recommend backpatching, but I'm alone and outvoted by more experienced people :^) OK, so attached is a small patch to eradicate this HINT: CI tested, verified using original reproducer, registered in cf app. -J. [1] - https://github.com/postgres/postgres/commit/00066aa1733d84109f7569a7202c3915d8289d3a