Thread: Re: FSM doesn't recover after zeroing damaged page.

Re: FSM doesn't recover after zeroing damaged page.

From
vignesh C
Date:
On Fri, 7 Feb 2025 at 05:45, Anton A. Melnikov
<a.melnikov@postgrespro.ru> wrote:
>
> Here is a small patch that does it and eliminates multiple warnings.
> Would be glad if you take a look on it.

I noticed that Kirill's comments from [1] are not yet addressed, I
have changed the status of commitfest entry to Waiting on Author,
kindly address them and update the status to Needs review.
[1] - https://www.postgresql.org/message-id/CALdSSPjg0hBUvPASkVt799z7%2BJOYdcE0j_WznbxjVB0H05M%2Bqw%40mail.gmail.com

Regards,
Vignesh



Re: FSM doesn't recover after zeroing damaged page.

From
"Anton A. Melnikov"
Date:
Hi!

On 16.03.2025 16:03, vignesh C wrote:
> I noticed that Kirill's comments from [1] are not yet addressed, I
> have changed the status of commitfest entry to Waiting on Author,
> kindly address them and update the status to Needs review.
> [1] - https://www.postgresql.org/message-id/CALdSSPjg0hBUvPASkVt799z7%2BJOYdcE0j_WznbxjVB0H05M%2Bqw%40mail.gmail.com

Thanks for reminding me!

On 10.03.2025 16:56, Kirill Reshke wrote:
>> since as i suppose the corrupted page must be rewritten certainly, not for hint.
> 
> 
> Could you please elaborate? FSM changes are never wal-logged, so it is
> possible to read torn pages from a disk (their checksum will mismatch
> with the header), so
> MarkBufferDirtyHint seems to be completely fine here. I don't think
> MarkBufferDirty provides something different from MarkBufferDirtyHint
> in the FSM case (because FSM changed are not persistent). 

Sorry for the long delay in replying.
The problem turned out to be not so simple as i firstly thought.

If we break down the reproduction of the issue from the first email [1]
from the perspective of a database user, the following is occurring:

1) He sees a message about a corrupted page and that it has been zeroed out.
This means the issue with torn page is resolved.

>  At the end
> of the day, you just should write a page on a disk sooner or later,
> and that's it.
2) This is exactly what happens after a checkpoint is executed.
Now the user is confident that the page is in a normal state on disk.

3) However, after a server restart, he sees that the same page is corrupted again.
This means the page was not saved to disk. And so on infinitely.

IMO, this is annoying and very much like a bug.

That said, the fix initially proposed seems incorrect and overly crude to me,
as this behavior does not occur with every FSM page but only under specific conditions.
E. g., the error will not recur if it was the last incomplete FSM page.
I think firstly it is necessary to understand the reasons for this difference in behavior.
So i plan to dig deeper into the FSM algorithm and come up with a more targeted fix.

With the best wishes,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[1] https://www.postgresql.org/message-id/a61efc0b-9cfc-4f24-ac5d-ea6600d9ccbf%40postgrespro.ru