Thread: FSM corruption leading to errors

FSM corruption leading to errors

From
Pavan Deolasee
Date:
I investigated a bug report from one of our customers and it looked very similar to previous bug reports here [1], [2], [3] (and probably more). In these reports, the error looks something like this:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The function returns without setting the DIRTY bit on the standby:

3413             /*
3414              * If we're in recovery we cannot dirty a page because of a hint.
3415              * We can set the hint, just not dirty the page as a result so the
3416              * hint is lost when we evict the page or shutdown.
3417              *
3418              * See src/backend/storage/page/README for longer discussion.
3419              */
3420             if (RecoveryInProgress())
3421                 return;
3422 

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to the FSM. I think that's usually alright because FSM changes are not WAL logged and if FSM ever returns a block with less free space than the caller needs, the caller is usually prepared to update the FSM and request for a new block. But if it returns a block that is outside the size of the relation, then we've a trouble. The very next ReadBuffer() fails to handle such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove references to the heap blocks that are being truncated. But since the FSM buffer may not be marked DIRTY on the standby, if the buffer gets evicted from the buffer cache, the on-disk copy of the FSM page may be left with references to the truncated heap pages. When the standby is later promoted to be the master, and an insert/update is attempted to the table, the FSM may return a block that is outside the valid range of the relation. That results in the said error.

Once this was clear, it was easy to put together a fully reproducible test case. See the attached script; you'll need to adjust to your environment. This affects all releases starting 9.3 and the script can reproduce the problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is critical and the buffer must be marked by MarkBufferDirty() i.e. those changes must make to the disk. I think it's alright not to WAL log them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must not be lost across a checkpoint. Also, since it happens only during relation truncation, I don't see any problem from performance perspective.

What bothers me is how to fix the problem for already affected standbys. If the FSM for some table is already corrupted at the standby, users won't notice it until the standby is promoted to be the new master. If the standby starts throwing errors suddenly after failover, it will be a very bad situation for the users, like we noticed with our customers. The fix is simple and users can just delete the FSM (and VACUUM the table), but that doesn't sound nice and they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is outside the range and discard such blocks after setting the FSM (attached patch does that). The problem with that approach is that RelationGetNumberOfBlocks() is not really cheap and invoking it everytime FSM is consulted may not be a bright idea. Can we cache that value in the RelationData or some such place (BulkInsertState?) and use that as a hint for quickly checking if the block is (potentially) outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the real problem, until I saw RecoveryInProgress() specific code, is: can this also affect stand-alone masters? The comments at MarkBufferDirtyHint() made me think so:

3358  * 3. This function does not guarantee that the buffer is always marked dirty
3359  *    (due to a race condition), so it cannot be used for important changes.

So I was working with a theory that somehow updates to the FSM page are lost because the race mentioned in the comment actually kicks in. But I'm not sure if the race is only possible when the caller is holding a SHARE lock on the buffer. When the FSM is truncated, the caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not reproduce the issue on a stand-alone master. But probably worth checking.

It might also be a good idea to inspect other callers of MarkBufferDirtyHint() and see if any of them is vulnerable, especially from standby perspective. I did one round, and couldn't see another problem.

Thanks,
Pavan



--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: FSM corruption leading to errors

From
Anastasia Lubennikova
Date:
06.10.2016 20:59, Pavan Deolasee:
I investigated a bug report from one of our customers and it looked very similar to previous bug reports here [1], [2], [3] (and probably more). In these reports, the error looks something like this:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The function returns without setting the DIRTY bit on the standby:

3413             /*
3414              * If we're in recovery we cannot dirty a page because of a hint.
3415              * We can set the hint, just not dirty the page as a result so the
3416              * hint is lost when we evict the page or shutdown.
3417              *
3418              * See src/backend/storage/page/README for longer discussion.
3419              */
3420             if (RecoveryInProgress())
3421                 return;
3422 

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to the FSM. I think that's usually alright because FSM changes are not WAL logged and if FSM ever returns a block with less free space than the caller needs, the caller is usually prepared to update the FSM and request for a new block. But if it returns a block that is outside the size of the relation, then we've a trouble. The very next ReadBuffer() fails to handle such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove references to the heap blocks that are being truncated. But since the FSM buffer may not be marked DIRTY on the standby, if the buffer gets evicted from the buffer cache, the on-disk copy of the FSM page may be left with references to the truncated heap pages. When the standby is later promoted to be the master, and an insert/update is attempted to the table, the FSM may return a block that is outside the valid range of the relation. That results in the said error.

Once this was clear, it was easy to put together a fully reproducible test case. See the attached script; you'll need to adjust to your environment. This affects all releases starting 9.3 and the script can reproduce the problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is critical and the buffer must be marked by MarkBufferDirty() i.e. those changes must make to the disk. I think it's alright not to WAL log them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must not be lost across a checkpoint. Also, since it happens only during relation truncation, I don't see any problem from performance perspective.

What bothers me is how to fix the problem for already affected standbys. If the FSM for some table is already corrupted at the standby, users won't notice it until the standby is promoted to be the new master. If the standby starts throwing errors suddenly after failover, it will be a very bad situation for the users, like we noticed with our customers. The fix is simple and users can just delete the FSM (and VACUUM the table), but that doesn't sound nice and they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is outside the range and discard such blocks after setting the FSM (attached patch does that). The problem with that approach is that RelationGetNumberOfBlocks() is not really cheap and invoking it everytime FSM is consulted may not be a bright idea. Can we cache that value in the RelationData or some such place (BulkInsertState?) and use that as a hint for quickly checking if the block is (potentially) outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the real problem, until I saw RecoveryInProgress() specific code, is: can this also affect stand-alone masters? The comments at MarkBufferDirtyHint() made me think so:

3358  * 3. This function does not guarantee that the buffer is always marked dirty
3359  *    (due to a race condition), so it cannot be used for important changes.

So I was working with a theory that somehow updates to the FSM page are lost because the race mentioned in the comment actually kicks in. But I'm not sure if the race is only possible when the caller is holding a SHARE lock on the buffer. When the FSM is truncated, the caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not reproduce the issue on a stand-alone master. But probably worth checking.

It might also be a good idea to inspect other callers of MarkBufferDirtyHint() and see if any of them is vulnerable, especially from standby perspective. I did one round, and couldn't see another problem.

Thanks,
Pavan



--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> I investigated a bug report from one of our customers and it looked very
> similar to previous bug reports here [1], [2], [3] (and probably more). In
> these reports, the error looks something like this:
>
> ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
> of 8192 bytes
>
> I traced it to the following code in MarkBufferDirtyHint().
>
> freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
> the FSM. I think that's usually alright because FSM changes are not WAL
> logged and if FSM ever returns a block with less free space than the caller
> needs, the caller is usually prepared to update the FSM and request for a
> new block. But if it returns a block that is outside the size of the
> relation, then we've a trouble. The very next ReadBuffer() fails to handle
> such a block and throws the error.

To be honest, I have been seeing a very similar issue for a couple of
weeks now on some test beds on a couple of relations involving a
promoted standby, this error happening more frequently for relations
having a more aggressive autovacuum setup. I did not take the time to
look at that seriously, and I am very glad to see this email.

> When a relation is truncated, the FSM is truncated too to remove references
> to the heap blocks that are being truncated. But since the FSM buffer may
> not be marked DIRTY on the standby, if the buffer gets evicted from the
> buffer cache, the on-disk copy of the FSM page may be left with references
> to the truncated heap pages. When the standby is later promoted to be the
> master, and an insert/update is attempted to the table, the FSM may return a
> block that is outside the valid range of the relation. That results in the
> said error.

Oops.

> I believe the fix is very simple. The FSM change during truncation is
> critical and the buffer must be marked by MarkBufferDirty() i.e. those
> changes must make to the disk. I think it's alright not to WAL log them
> because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
> not be lost across a checkpoint. Also, since it happens only during relation
> truncation, I don't see any problem from performance perspective.

Agreed. I happen to notice that VM is similalry careful when it comes
to truncate it (visibilitymap_truncate).

> What bothers me is how to fix the problem for already affected standbys. If
> the FSM for some table is already corrupted at the standby, users won't
> notice it until the standby is promoted to be the new master. If the standby
> starts throwing errors suddenly after failover, it will be a very bad
> situation for the users, like we noticed with our customers. The fix is
> simple and users can just delete the FSM (and VACUUM the table), but that
> doesn't sound nice and they would not know until they see the problem.

+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted. And isn't that going to break once the
relation is extended again? I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.

I have also been thinking about the comment you added and we could
just do something like that:
+       /*
+        * Mark the buffer still holding references to truncated blocks as
+        * dirty to be sure that this makes it to disk and is kept consistent
+        * with the truncated relation.
+        */
+       MarkBufferDirty(buf)

> It might also be a good idea to inspect other callers of
> MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
> standby perspective. I did one round, and couldn't see another problem.

Haven't looked at those yet, will do so tomorrow.

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..
-- 
Michael



Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Mon, Oct 10, 2016 at 11:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> At the same time, I have translated your script into a TAP test, I
> found that more useful when testing..

Well... Here is the actual patch.
--
Michael

Attachment

Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Fri, Oct 7, 2016 at 11:50 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Could you please add the patches to commitfest?
> I'm going to test them and write a review in a few days.

Here you go:
https://commitfest.postgresql.org/11/817/
-- 
Michael



Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:


+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range blocks. But now that I look at it again, may be this is not correct and it may get into an endless loop if the relation is repeatedly extended concurrently.
 
And isn't that going to break once the
relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I added the above code to mostly deal with already corrupt FSMs. May be we can just document and leave it to the user to run some correctness checks (see below), especially given that the code is not cheap and adds overheads for everybody, irrespective of whether they have or will ever have corrupt FSM.
 
I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.


Right, that's how I tested for broken FSMs. A challenge with any such query is that if the shared buffer copy of the FSM page is intact, then the query won't return problematic FSMs. Of course, if the fix is applied to the standby and is restarted, then corrupt FSMs can be detected.
 

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..

Thanks for doing that.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>>
>>
>> +   /*
>> +    * See comments in GetPageWithFreeSpace about handling outside the
>> valid
>> +    * range blocks
>> +    */
>> +   nblocks = RelationGetNumberOfBlocks(rel);
>> +   while (target_block >= nblocks && target_block != InvalidBlockNumber)
>> +   {
>> +       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
>> +               spaceNeeded);
>> +   }
>> Hm. This is just a workaround. Even if things are done this way the
>> FSM will remain corrupted.
>
>
> No, because the code above updates the FSM of those out-of-the range blocks.
> But now that I look at it again, may be this is not correct and it may get
> into an endless loop if the relation is repeatedly extended concurrently.

Ah yes, that's what the call for
RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed
that yesterday before sleeping.

>> And isn't that going to break once the
>> relation is extended again?
>
>
> Once the underlying bug is fixed, I don't see why it should break again. I
> added the above code to mostly deal with already corrupt FSMs. May be we can
> just document and leave it to the user to run some correctness checks (see
> below), especially given that the code is not cheap and adds overheads for
> everybody, irrespective of whether they have or will ever have corrupt FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

>> I'd suggest instead putting in the release
>> notes a query that allows one to analyze what are the relations broken
>> and directly have them fixed. That's annoying, but it would be really
>> better than a workaround. One idea here is to use pg_freespace() and
>> see if it returns a non-zero value for an out-of-range block on a
>> standby.
>>
>
> Right, that's how I tested for broken FSMs. A challenge with any such query
> is that if the shared buffer copy of the FSM page is intact, then the query
> won't return problematic FSMs. Of course, if the fix is applied to the
> standby and is restarted, then corrupt FSMs can be detected.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)
-- 
Michael



Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Tue, Oct 11, 2016 at 5:20 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

>
> Once the underlying bug is fixed, I don't see why it should break again. I
> added the above code to mostly deal with already corrupt FSMs. May be we can
> just document and leave it to the user to run some correctness checks (see
> below), especially given that the code is not cheap and adds overheads for
> everybody, irrespective of whether they have or will ever have corrupt FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

I'm okay with that. It's annoying, especially because the bug may show up when your primary is down and you just failed over for HA, only to find that the standby won't work correctly. But I don't have ideas how to fix existing corruption without adding significant penalty to normal path.
 

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)


Yes, that will enough once the fix is in place.

I think this is a major bug and I would appreciate any ideas to get the patch in a committable shape before the next minor release goes out. We probably need a committer to get interested in this to make progress.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Mon, Oct 17, 2016 at 4:14 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> I think this is a major bug and I would appreciate any ideas to get the
> patch in a committable shape before the next minor release goes out. We
> probably need a committer to get interested in this to make progress.

Same opinion here, this is very annoying for some of my internal
users.. I don't have more to offer though.
-- 
Michael



Re: FSM corruption leading to errors

From
Heikki Linnakangas
Date:
On 10/10/2016 05:25 PM, Michael Paquier wrote:
> On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>> I believe the fix is very simple. The FSM change during truncation is
>> critical and the buffer must be marked by MarkBufferDirty() i.e. those
>> changes must make to the disk. I think it's alright not to WAL log them
>> because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
>> not be lost across a checkpoint. Also, since it happens only during relation
>> truncation, I don't see any problem from performance perspective.
>
> Agreed. I happen to notice that VM is similalry careful when it comes
> to truncate it (visibilitymap_truncate).

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are
truncated. But visibilitymap_truncate() has already modified and dirtied
the page. If the VM page change is flushed to disk before the WAL
record, and you crash, you might have a torn VM page and a checksum failure.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty()
in FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's
LSN to make sure the WAL record is flushed first.

I think we need something like the attached.

- Heikki


Attachment

Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Mon, Oct 17, 2016 at 8:04 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> visibilitymap_truncate is actually also wrong, in a different way. The
> truncation WAL record is written only after the VM (and FSM) are truncated.
> But visibilitymap_truncate() has already modified and dirtied the page. If
> the VM page change is flushed to disk before the WAL record, and you crash,
> you might have a torn VM page and a checksum failure.

Good point! I didn't think about that.

> Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
> FreeSpaceMapTruncateRel would have the same issue. If you call
> MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
> to make sure the WAL record is flushed first.

Right.. All the code paths calling FreeSpaceMapTruncateRel or
visibilitymap_truncate do flush the record, but it definitely make
things more robust to set the LSN on the page. So +1 this way.

> I think we need something like the attached.

OK, I had a look at that.

        MarkBufferDirty(mapBuffer);
+       if (lsn != InvalidXLogRecPtr)
+           PageSetLSN(page, lsn);
        UnlockReleaseBuffer(mapBuffer);
Nit: using XLogRecPtrIsInvalid() makes more sense for such checks?

pg_visibility calls as well visibilitymap_truncate, but this patch did
not update it. And it would be better to do the actual truncate after
writing the WAL record I think.

You are also breaking XLogFlush() in RelationTruncate() because vm and
fsm are assigned thanks to a call of smgrexists(), something done
before XLogFlush is called on HEAD, and after with your patch. So your
patch finishes by never calling XLogFlush() on relation truncation
even if there is a VM or a FSM.

Attached is an updated version with those issues fixed. The final
version does not need the test as that's really a corner-case, but I
still included it to help testing for now.
--
Michael

Attachment

Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:


visibilitymap_truncate is actually also wrong, in a different way. The truncation WAL record is written only after the VM (and FSM) are truncated. But visibilitymap_truncate() has already modified and dirtied the page. If the VM page change is flushed to disk before the WAL record, and you crash, you might have a torn VM page and a checksum failure.


Right, I missed the problem with checksums.
 
Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in FreeSpaceMapTruncateRel would have the same issue. If you call MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN to make sure the WAL record is flushed first.


I'm not sure I fully understand this. If the page is flushed before the WAL record, we might have a truncated FSM where as the relation truncation is not replayed. But that's not the same problem, right? I mean, you might have an FSM which is not accurate, but it won't at the least return the blocks outside the range. Having said that, I agree your proposed changes are more robust. 

BTW any thoughts on race-condition on the primary? Comments at MarkBufferDirtyHint() seems to suggest that a race condition is possible which might leave the buffer without the DIRTY flag, but I'm not sure if that can only happen when the page is locked in shared mode. While most of the reports so far involved standbys, and the bug can also hit a standalone master during crash recovery, I wonder if a race can occur even on a live system.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Heikki Linnakangas
Date:
On 10/18/2016 07:01 AM, Pavan Deolasee wrote:
> On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
>> visibilitymap_truncate is actually also wrong, in a different way. The
>> truncation WAL record is written only after the VM (and FSM) are truncated.
>> But visibilitymap_truncate() has already modified and dirtied the page. If
>> the VM page change is flushed to disk before the WAL record, and you crash,
>> you might have a torn VM page and a checksum failure.
>
> Right, I missed the problem with checksums.
>
>> Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
>> FreeSpaceMapTruncateRel would have the same issue. If you call
>> MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
>> to make sure the WAL record is flushed first.
>
> I'm not sure I fully understand this. If the page is flushed before the WAL
> record, we might have a truncated FSM where as the relation truncation is
> not replayed. But that's not the same problem, right? I mean, you might
> have an FSM which is not accurate, but it won't at the least return the
> blocks outside the range. Having said that, I agree your proposed changes
> are more robust.

Actually, this is still not 100% safe. Flushing the WAL before modifying
the FSM page is not enough. We also need to WAL-log a full-page image of
the FSM page, otherwise we are still vulnerable to the torn page problem.

I came up with the attached. This is fortunately much simpler than my
previous attempt. I replaced the MarkBufferDirtyHint() calls with
MarkBufferDirty(), to fix the original issue, plus WAL-logging a
full-page image to fix the torn page issue.

> BTW any thoughts on race-condition on the primary? Comments at
> MarkBufferDirtyHint() seems to suggest that a race condition is possible
> which might leave the buffer without the DIRTY flag, but I'm not sure if
> that can only happen when the page is locked in shared mode.

I think the race condition can only happen when the page is locked in
shared mode. In any case, with this proposed fix, we'll use
MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.

- Heikki


Attachment

Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Wed, Oct 19, 2016 at 2:37 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:


Actually, this is still not 100% safe. Flushing the WAL before modifying the FSM page is not enough. We also need to WAL-log a full-page image of the FSM page, otherwise we are still vulnerable to the torn page problem.

I came up with the attached. This is fortunately much simpler than my previous attempt. I replaced the MarkBufferDirtyHint() calls with MarkBufferDirty(), to fix the original issue, plus WAL-logging a full-page image to fix the torn page issue.


Looks good to me.
 
BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode.

I think the race condition can only happen when the page is locked in shared mode. In any case, with this proposed fix, we'll use MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.


Yes, the fix will cover that problem (if it exists). The reason why I was curious to know is because there are several reports of similar error in the past and some of them did not involve as standby. Those reports mostly remained unresolved and I wondered if this explains them. But yeah, my conclusion was that the race is not possible with page locked in EXCLUSIVE mode. So may be there is another problem somewhere or a crash recovery may have left the FSM in inconsistent state.

Anyways, we seem good to go with the patch.

Thanks,
Pavan
--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Heikki Linnakangas
Date:
On 10/19/2016 01:07 PM, Pavan Deolasee wrote:
> Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

- Heikki




Re: FSM corruption leading to errors

From
Heikki Linnakangas
Date:
On 10/19/2016 02:29 PM, Heikki Linnakangas wrote:
> On 10/19/2016 01:07 PM, Pavan Deolasee wrote:
>> Anyways, we seem good to go with the patch.
>
> Ok, committed. Thanks for the analysis!

Oh, forgot that this needs to be backported, of course. Will do that 
shortly...

- Heikki




Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Wed, Oct 19, 2016 at 8:29 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 10/19/2016 01:07 PM, Pavan Deolasee wrote:
>>
>> Anyways, we seem good to go with the patch.
>
> Ok, committed. Thanks for the analysis!

Thanks! I am surprised that you kept the TAP test at the end.
-- 
Michael



Re: FSM corruption leading to errors

From
Heikki Linnakangas
Date:
On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:
> On 10/19/2016 02:29 PM, Heikki Linnakangas wrote:
>> On 10/19/2016 01:07 PM, Pavan Deolasee wrote:
>>> Anyways, we seem good to go with the patch.
>>
>> Ok, committed. Thanks for the analysis!
>
> Oh, forgot that this needs to be backported, of course. Will do that
> shortly...

Done.

This didn't include anything to cope with an already-corrupt FSM, BTW. 
Do we still want to try something for that? I think it's good enough if 
we prevent the FSM corruption from happening, but not sure what the 
consensus on that might be..

- Heikki




Re: FSM corruption leading to errors

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> This didn't include anything to cope with an already-corrupt FSM, BTW. 
> Do we still want to try something for that? I think it's good enough if 
> we prevent the FSM corruption from happening, but not sure what the 
> consensus on that might be..

Can we document an existing procedure for repairing FSM corruption?
(VACUUM, maybe?)  We're going to need to be able to explain how to
fix busted visibility maps in 9.6.1, too.
        regards, tom lane



Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Wed, Oct 19, 2016 at 6:44 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:


Oh, forgot that this needs to be backported, of course. Will do that
shortly...

Done.

Thanks!
 

This didn't include anything to cope with an already-corrupt FSM, BTW. Do we still want to try something for that? I think it's good enough if we prevent the FSM corruption from happening, but not sure what the consensus on that might be..


I thought it will be nice to handle already corrupt FSM since our customer found it immediately after a failover and then it was a serious issue. In one case, a system table was affected, thus preventing all DDLs from running. Having said that, I don't have a better idea to handle the problem without causing non-trivial overhead for normal cases (see my original patch). If you've better ideas, it might be worth pursuing.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:

On Wed, Oct 19, 2016 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:


Can we document an existing procedure for repairing FSM corruption?
(VACUUM, maybe?) 

I'm afraid it may not be easy to repair the corruption with existing facilities. Most often the corruption will be on the standby and a VACUUM may not actually touch affected pages on the master (because they may not even exists on the master or skipped because of visibility maps). It may not even trigger relation truncation. So AFAICS it may not generate any WAL activity that can fix the corruption on the standby.

One possible way would be to delete the FSM (and VM) information on the master and standby and then run VACUUM so these maps are rebuilt. We obviously don't need to do this for all tables, but we need a way to find the tables with corrupt FSM [1].

Suggested procedure could be:

1. Upgrade master and standby to the latest minor release (which involves restart)
2. Install pg_freespace extension and run the query [1] on master to find possible corruption cases. The query checks if FSM reports free space in a block outside the size of the relation. Unfortunately, we might have false positives if the relation is extended while the query is running.
3. Repeat the same query on standby (if it's running in Hot standby mode, otherwise the corruption can only be detected once it's promoted to be a master)
4. Remove FSM and VM files for the affected tables (I don't think if it's safe to do this on a running server)
5. VACUUM affected tables so that FSM and VM is rebuilt.

Another idea is to implement a pg_freespace_repair() function in pg_freespace which takes an AccessExclusiveLock on the table and truncates it to it's current size, thus generating a WAL record that the standby will replay to fix the corruption. This probably looks more promising, easy to explain and less error prone.


[1]  SELECT *
  FROM (
        SELECT oid::regclass as relname, EXISTS (
                SELECT *
                  FROM (
                        SELECT blkno, pg_freespace(oid::regclass, blkno)
                          FROM generate_series(pg_relation_size(oid::regclass)/ current_setting('block_size')::bigint, pg_relation_size(oid::regclass, 'fsm') / 2) as blkno
                       ) as avail
                 WHERE pg_freespace > 0
               ) as corrupt_fsm
          FROM pg_class
         WHERE relkind = 'r'
       ) b
 WHERE b.corrupt_fsm = true;


Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Thu, Oct 20, 2016 at 2:11 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> 4. Remove FSM and VM files for the affected tables (I don't think if it's
> safe to do this on a running server)

Definitely not while the server is running... For VMs a good way would
be to use pg_visibility's pg_truncate_visibility_map(), but only for
9.6~. For FSM there is no real solution, and actually a
pg_truncate_fsm would prove to be useful here. So users would need to
stop once the server if there are corrupted VMs or FSMs, delete them
manually, and then run VACUUM to rebuild them.
-- 
Michael



Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Thu, Oct 20, 2016 at 10:50 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

 For VMs a good way would
be to use pg_visibility's pg_truncate_visibility_map(), but only for
9.6~.

Ah ok.. 
 
For FSM there is no real solution, and actually a
pg_truncate_fsm would prove to be useful here.

Right, that's what I proposed as an alternate idea. I agree this is much cleaner and less error prone approach.

Actually, if we could add an API which can truncate FSM to the given heap block, then the user may not even need to run VACUUM, which could be costly for very large tables. Also, AFAICS we will need to backport pg_truncate_visibility_map() to older releases because unless the VM is truncated along with the FSM, VACUUM may not scan all pages and the FSM for those pages won't be recorded. 

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Actually, if we could add an API which can truncate FSM to the given heap
> block, then the user may not even need to run VACUUM, which could be costly
> for very large tables.

FreeSpaceMapTruncateRel()?

> Also, AFAICS we will need to backport
> pg_truncate_visibility_map() to older releases because unless the VM is
> truncated along with the FSM, VACUUM may not scan all pages and the FSM for
> those pages won't be recorded.

This would need a careful lookup, and it would not be that complicated
to implement. And this bug justifies the presence of a tool like that
actually... But usually those don't get a backpatch, so the
probability of getting that backported is low IMO, personally I am not
sure I like that either.
-- 
Michael



Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Thu, Oct 20, 2016 at 11:34 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Actually, if we could add an API which can truncate FSM to the given heap
> block, then the user may not even need to run VACUUM, which could be costly
> for very large tables.

FreeSpaceMapTruncateRel()?

Right. We need a SQL callable function to invoke that.
 

> Also, AFAICS we will need to backport
> pg_truncate_visibility_map() to older releases because unless the VM is
> truncated along with the FSM, VACUUM may not scan all pages and the FSM for
> those pages won't be recorded.

This would need a careful lookup, and it would not be that complicated
to implement. And this bug justifies the presence of a tool like that
actually... But usually those don't get a backpatch, so the
probability of getting that backported is low IMO, personally I am not
sure I like that either.

Just to clarify, I meant if we truncate the entire FSM then we'll need API to truncate VM as well so that VACUUM rebuilds everything completely. OTOH if we provide a function just to truncate FSM to match the size of the table, then we don't need to rebuild the FSM. So that's probably a better way to handle FSM corruption, as far as this particular issue is concerned.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Thu, Oct 20, 2016 at 3:37 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Just to clarify, I meant if we truncate the entire FSM then we'll need API
> to truncate VM as well so that VACUUM rebuilds everything completely. OTOH
> if we provide a function just to truncate FSM to match the size of the
> table, then we don't need to rebuild the FSM. So that's probably a better
> way to handle FSM corruption, as far as this particular issue is concerned.

To be honest, I think that just having in the release notes the method
that does not involve the use any extra extension or SQL routine is
fine. So we could just tell to users to:
1) Run something like the query you gave upthread, giving to the user
a list of the files that are corrupted. And add this query to the
release notes.
2) If anything is found, stop the server and delete the files manually.
3) Re-start the server.
OK, that's troublesome and costly for large relations, but we know
that's the safest way to go for any versions, and there is no need to
complicate the code with any one-time repairing extensions.

Speaking of which, I implemented a small extension able to truncate
the FSM up to the size of the relation as attached, but as I looked at
it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action
is rather limited... And I pushed as well a version on github:
https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation
The limitation range of such an extension is a argument good enough to
just rely on the stop/delete-FSM/start method to fix an instance and
let VACUUM do the rest of the work. That looks to work but use it at
your own risk.

This bug would be a good blog topic by the way...
--
Michael

Attachment

Re: FSM corruption leading to errors

From
Jim Nasby
Date:
On 10/20/16 10:15 PM, Michael Paquier wrote:
> 2) If anything is found, stop the server and delete the files manually.
> 3) Re-start the server.
> OK, that's troublesome and costly for large relations, but we know
> that's the safest way to go for any versions, and there is no need to
> complicate the code with any one-time repairing extensions.

Wouldn't you need to run around to all your replicas and do that as well?

> Speaking of which, I implemented a small extension able to truncate
> the FSM up to the size of the relation as attached, but as I looked at
> it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action
> is rather limited... And I pushed as well a version on github:
> https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation
> The limitation range of such an extension is a argument good enough to
> just rely on the stop/delete-FSM/start method to fix an instance and
> let VACUUM do the rest of the work. That looks to work but use it at
> your own risk.

Shouldn't the truncation be logged before it's performed?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 10/20/16 10:15 PM, Michael Paquier wrote:
>>
>> 2) If anything is found, stop the server and delete the files manually.
>> 3) Re-start the server.
>> OK, that's troublesome and costly for large relations, but we know
>> that's the safest way to go for any versions, and there is no need to
>> complicate the code with any one-time repairing extensions.
>
> Wouldn't you need to run around to all your replicas and do that as well?

Yeah...

>> Speaking of which, I implemented a small extension able to truncate
>> the FSM up to the size of the relation as attached, but as I looked at
>> it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action
>> is rather limited... And I pushed as well a version on github:
>> https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation
>> The limitation range of such an extension is a argument good enough to
>> just rely on the stop/delete-FSM/start method to fix an instance and
>> let VACUUM do the rest of the work. That looks to work but use it at
>> your own risk.
>
>
> Shouldn't the truncation be logged before it's performed?

Doh. Yes, thanks for the reminder. And I commented on that upthread..
-- 
Michael



Re: FSM corruption leading to errors

From
Michael Paquier
Date:
On Sat, Oct 22, 2016 at 7:31 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 10/20/16 10:15 PM, Michael Paquier wrote:
>>>
>>> 2) If anything is found, stop the server and delete the files manually.
>>> 3) Re-start the server.
>>> OK, that's troublesome and costly for large relations, but we know
>>> that's the safest way to go for any versions, and there is no need to
>>> complicate the code with any one-time repairing extensions.
>>
>> Wouldn't you need to run around to all your replicas and do that as well?
>
> Yeah...

Release notes refer to this page for methods to fix corrupted instances:
https://wiki.postgresql.org/wiki/Free_Space_Map_Problems
I just typed something based on Pavan's upper method, feel free to
jump in and improve it as necessary.
-- 
Michael



Re: FSM corruption leading to errors

From
Tom Lane
Date:
I wrote:
> It looks to me like this is approximating the highest block number that
> could possibly have an FSM entry as size of the FSM fork (in bytes)
> divided by 2.  But the FSM stores one byte per block.  There is overhead
> for the FSM search tree, but in a large relation it's not going to be as
> much as a factor of 2.  So I think that to be conservative we need to
> drop the "/ 2".  Am I missing something?

Ah, scratch that, after rereading the FSM README I see it's correct,
because there's a binary tree within each page; I'd only remembered
that there was a search tree of pages.

Also, we could at least discount the FSM root page and first intermediate
page, no?  That is, the upper limit could be
pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT

I think this is a worthwhile improvement because it reduces the time spent
on small relations.  For me, the query as given takes 9 seconds to examine
the regression database, which seems like a lot.  Discounting two pages
reduces that to 20 ms.
        regards, tom lane



Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Mon, Oct 24, 2016 at 9:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

    SELECT blkno, pg_freespace(oid::regclass, blkno)
    FROM generate_series(pg_relation_size(oid::regclass) / current_setting('block_size')::BIGINT,
                         pg_relation_size(oid::regclass, 'fsm') / 2) AS blkno

It looks to me like this is approximating the highest block number that
could possibly have an FSM entry as size of the FSM fork (in bytes)
divided by 2.  But the FSM stores one byte per block.  There is overhead
for the FSM search tree, but in a large relation it's not going to be as
much as a factor of 2.  So I think that to be conservative we need to
drop the "/ 2".  Am I missing something?


I went by these comments in fsm_internals.h, which suggest that the SlotsPerFSMPage are limited to somewhat less than BLCKSZ divided by 2.

/*
 * Number of non-leaf and leaf nodes, and nodes in total, on an FSM page.
 * These definitions are internal to fsmpage.c.
 */ 
#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
                      offsetof(FSMPageData, fp_nodes))
    
#define NonLeafNodesPerPage (BLCKSZ / 2 - 1)
#define LeafNodesPerPage (NodesPerPage - NonLeafNodesPerPage)

/*
 * Number of FSM "slots" on a FSM page. This is what should be used
 * outside fsmpage.c.
 */
#define SlotsPerFSMPage LeafNodesPerPage 


Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> Release notes refer to this page for methods to fix corrupted instances:
> https://wiki.postgresql.org/wiki/Free_Space_Map_Problems
> I just typed something based on Pavan's upper method, feel free to
> jump in and improve it as necessary.

Thanks for drafting that, but isn't the query wrong?
Specifically the bit about
   SELECT blkno, pg_freespace(oid::regclass, blkno)   FROM generate_series(pg_relation_size(oid::regclass) /
current_setting('block_size')::BIGINT,                       pg_relation_size(oid::regclass, 'fsm') / 2) AS blkno 

It looks to me like this is approximating the highest block number that
could possibly have an FSM entry as size of the FSM fork (in bytes)
divided by 2.  But the FSM stores one byte per block.  There is overhead
for the FSM search tree, but in a large relation it's not going to be as
much as a factor of 2.  So I think that to be conservative we need to
drop the "/ 2".  Am I missing something?
        regards, tom lane



Re: FSM corruption leading to errors

From
Pavan Deolasee
Date:


On Mon, Oct 24, 2016 at 9:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:


Also, we could at least discount the FSM root page and first intermediate
page, no?  That is, the upper limit could be

        pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT


+1 for doing that.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: FSM corruption leading to errors

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Ah, scratch that, after rereading the FSM README I see it's correct,
> because there's a binary tree within each page; I'd only remembered
> that there was a search tree of pages.
> 
> Also, we could at least discount the FSM root page and first intermediate
> page, no?  That is, the upper limit could be
> 
>     pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT
> 
> I think this is a worthwhile improvement because it reduces the time spent
> on small relations.  For me, the query as given takes 9 seconds to examine
> the regression database, which seems like a lot.  Discounting two pages
> reduces that to 20 ms.

Hah, good one.  We spent some time thinking about subtracting some value
to make the value more accurate but it didn't occur to me to just use
constant two.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: FSM corruption leading to errors

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Also, we could at least discount the FSM root page and first intermediate
>> page, no?  That is, the upper limit could be
>>
>> pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT
>>
>> I think this is a worthwhile improvement because it reduces the time spent
>> on small relations.  For me, the query as given takes 9 seconds to examine
>> the regression database, which seems like a lot.  Discounting two pages
>> reduces that to 20 ms.

> Hah, good one.  We spent some time thinking about subtracting some value
> to make the value more accurate but it didn't occur to me to just use
> constant two.

I got the arithmetic wrong in the above, it should be like

(pg_relation_size(oid::regclass, 'fsm') - 2*current_setting('block_size')::BIGINT) / 2

With that, the runtime on HEAD's regression DB is about 700 ms, which is
still a nice win over 9000 ms.

I've put up draft wiki pages about these two problems at

https://wiki.postgresql.org/wiki/Free_Space_Map_Problems
https://wiki.postgresql.org/wiki/Visibility_Map_Problems

(thanks to Michael Paquier for initial work on the first one).
They're meant to be reasonably generic about FSM/VM problems
rather than only being about our current bugs.  Please review.
        regards, tom lane