Re: FSM corruption leading to errors - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: FSM corruption leading to errors
Date
Msg-id CABOikdM5rw=25qQc+wZoYN5yym2r09Q9X0Ria4_P48CGeCRU_g@mail.gmail.com
Whole thread Raw
In response to Re: FSM corruption leading to errors  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: FSM corruption leading to errors  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers


On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:


+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range blocks. But now that I look at it again, may be this is not correct and it may get into an endless loop if the relation is repeatedly extended concurrently.
 
And isn't that going to break once the
relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I added the above code to mostly deal with already corrupt FSMs. May be we can just document and leave it to the user to run some correctness checks (see below), especially given that the code is not cheap and adds overheads for everybody, irrespective of whether they have or will ever have corrupt FSM.
 
I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.


Right, that's how I tested for broken FSMs. A challenge with any such query is that if the shared buffer copy of the FSM page is intact, then the query won't return problematic FSMs. Of course, if the fix is applied to the standby and is restarted, then corrupt FSMs can be detected.
 

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..

Thanks for doing that.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: FSM corruption leading to errors
Next
From: Merlin Moncure
Date:
Subject: Re: autonomous transactions