Thread: FSM corruption leading to errors
I investigated a bug report from one of our customers and it looked very similar to previous bug reports here [1], [2], [3] (and probably more). In these reports, the error looks something like this:
ERROR: could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytes
I traced it to the following code in MarkBufferDirtyHint(). The function returns without setting the DIRTY bit on the standby:
3413 /*
3414 * If we're in recovery we cannot dirty a page because of a hint.
3415 * We can set the hint, just not dirty the page as a result so the
3416 * hint is lost when we evict the page or shutdown.
3417 *
3418 * See src/backend/storage/page/README for longer discussion.
3419 */
3420 if (RecoveryInProgress())
3421 return;
3422
freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to the FSM. I think that's usually alright because FSM changes are not WAL logged and if FSM ever returns a block with less free space than the caller needs, the caller is usually prepared to update the FSM and request for a new block. But if it returns a block that is outside the size of the relation, then we've a trouble. The very next ReadBuffer() fails to handle such a block and throws the error.
When a relation is truncated, the FSM is truncated too to remove references to the heap blocks that are being truncated. But since the FSM buffer may not be marked DIRTY on the standby, if the buffer gets evicted from the buffer cache, the on-disk copy of the FSM page may be left with references to the truncated heap pages. When the standby is later promoted to be the master, and an insert/update is attempted to the table, the FSM may return a block that is outside the valid range of the relation. That results in the said error.
Once this was clear, it was easy to put together a fully reproducible test case. See the attached script; you'll need to adjust to your environment. This affects all releases starting 9.3 and the script can reproduce the problem on all these releases.
I believe the fix is very simple. The FSM change during truncation is critical and the buffer must be marked by MarkBufferDirty() i.e. those changes must make to the disk. I think it's alright not to WAL log them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must not be lost across a checkpoint. Also, since it happens only during relation truncation, I don't see any problem from performance perspective.
What bothers me is how to fix the problem for already affected standbys. If the FSM for some table is already corrupted at the standby, users won't notice it until the standby is promoted to be the new master. If the standby starts throwing errors suddenly after failover, it will be a very bad situation for the users, like we noticed with our customers. The fix is simple and users can just delete the FSM (and VACUUM the table), but that doesn't sound nice and they would not know until they see the problem.
One idea is to always check if the block returned by the FSM is outside the range and discard such blocks after setting the FSM (attached patch does that). The problem with that approach is that RelationGetNumberOfBlocks() is not really cheap and invoking it everytime FSM is consulted may not be a bright idea. Can we cache that value in the RelationData or some such place (BulkInsertState?) and use that as a hint for quickly checking if the block is (potentially) outside the range and discard it? Any other ideas?
The other concern I've and TBH that's what I initially thought as the real problem, until I saw RecoveryInProgress() specific code, is: can this also affect stand-alone masters? The comments at MarkBufferDirtyHint() made me think so:
3358 * 3. This function does not guarantee that the buffer is always marked dirty
3359 * (due to a race condition), so it cannot be used for important changes.
So I was working with a theory that somehow updates to the FSM page are lost because the race mentioned in the comment actually kicks in. But I'm not sure if the race is only possible when the caller is holding a SHARE lock on the buffer. When the FSM is truncated, the caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not reproduce the issue on a stand-alone master. But probably worth checking.
It might also be a good idea to inspect other callers of MarkBufferDirtyHint() and see if any of them is vulnerable, especially from standby perspective. I did one round, and couldn't see another problem.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Attachment
06.10.2016 20:59, Pavan Deolasee:
Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.
I investigated a bug report from one of our customers and it looked very similar to previous bug reports here [1], [2], [3] (and probably more). In these reports, the error looks something like this:ERROR: could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytesI traced it to the following code in MarkBufferDirtyHint(). The function returns without setting the DIRTY bit on the standby:3413 /*3414 * If we're in recovery we cannot dirty a page because of a hint.3415 * We can set the hint, just not dirty the page as a result so the3416 * hint is lost when we evict the page or shutdown.3417 *3418 * See src/backend/storage/page/README for longer discussion.3419 */3420 if (RecoveryInProgress())3421 return;3422freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to the FSM. I think that's usually alright because FSM changes are not WAL logged and if FSM ever returns a block with less free space than the caller needs, the caller is usually prepared to update the FSM and request for a new block. But if it returns a block that is outside the size of the relation, then we've a trouble. The very next ReadBuffer() fails to handle such a block and throws the error.When a relation is truncated, the FSM is truncated too to remove references to the heap blocks that are being truncated. But since the FSM buffer may not be marked DIRTY on the standby, if the buffer gets evicted from the buffer cache, the on-disk copy of the FSM page may be left with references to the truncated heap pages. When the standby is later promoted to be the master, and an insert/update is attempted to the table, the FSM may return a block that is outside the valid range of the relation. That results in the said error.Once this was clear, it was easy to put together a fully reproducible test case. See the attached script; you'll need to adjust to your environment. This affects all releases starting 9.3 and the script can reproduce the problem on all these releases.I believe the fix is very simple. The FSM change during truncation is critical and the buffer must be marked by MarkBufferDirty() i.e. those changes must make to the disk. I think it's alright not to WAL log them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must not be lost across a checkpoint. Also, since it happens only during relation truncation, I don't see any problem from performance perspective.What bothers me is how to fix the problem for already affected standbys. If the FSM for some table is already corrupted at the standby, users won't notice it until the standby is promoted to be the new master. If the standby starts throwing errors suddenly after failover, it will be a very bad situation for the users, like we noticed with our customers. The fix is simple and users can just delete the FSM (and VACUUM the table), but that doesn't sound nice and they would not know until they see the problem.One idea is to always check if the block returned by the FSM is outside the range and discard such blocks after setting the FSM (attached patch does that). The problem with that approach is that RelationGetNumberOfBlocks() is not really cheap and invoking it everytime FSM is consulted may not be a bright idea. Can we cache that value in the RelationData or some such place (BulkInsertState?) and use that as a hint for quickly checking if the block is (potentially) outside the range and discard it? Any other ideas?The other concern I've and TBH that's what I initially thought as the real problem, until I saw RecoveryInProgress() specific code, is: can this also affect stand-alone masters? The comments at MarkBufferDirtyHint() made me think so:3358 * 3. This function does not guarantee that the buffer is always marked dirty3359 * (due to a race condition), so it cannot be used for important changes.So I was working with a theory that somehow updates to the FSM page are lost because the race mentioned in the comment actually kicks in. But I'm not sure if the race is only possible when the caller is holding a SHARE lock on the buffer. When the FSM is truncated, the caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not reproduce the issue on a stand-alone master. But probably worth checking.It might also be a good idea to inspect other callers of MarkBufferDirtyHint() and see if any of them is vulnerable, especially from standby perspective. I did one round, and couldn't see another problem.Thanks,Pavan--Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > I investigated a bug report from one of our customers and it looked very > similar to previous bug reports here [1], [2], [3] (and probably more). In > these reports, the error looks something like this: > > ERROR: could not read block 28991 in file "base/16390/572026": read only 0 > of 8192 bytes > > I traced it to the following code in MarkBufferDirtyHint(). > > freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to > the FSM. I think that's usually alright because FSM changes are not WAL > logged and if FSM ever returns a block with less free space than the caller > needs, the caller is usually prepared to update the FSM and request for a > new block. But if it returns a block that is outside the size of the > relation, then we've a trouble. The very next ReadBuffer() fails to handle > such a block and throws the error. To be honest, I have been seeing a very similar issue for a couple of weeks now on some test beds on a couple of relations involving a promoted standby, this error happening more frequently for relations having a more aggressive autovacuum setup. I did not take the time to look at that seriously, and I am very glad to see this email. > When a relation is truncated, the FSM is truncated too to remove references > to the heap blocks that are being truncated. But since the FSM buffer may > not be marked DIRTY on the standby, if the buffer gets evicted from the > buffer cache, the on-disk copy of the FSM page may be left with references > to the truncated heap pages. When the standby is later promoted to be the > master, and an insert/update is attempted to the table, the FSM may return a > block that is outside the valid range of the relation. That results in the > said error. Oops. > I believe the fix is very simple. The FSM change during truncation is > critical and the buffer must be marked by MarkBufferDirty() i.e. those > changes must make to the disk. I think it's alright not to WAL log them > because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must > not be lost across a checkpoint. Also, since it happens only during relation > truncation, I don't see any problem from performance perspective. Agreed. I happen to notice that VM is similalry careful when it comes to truncate it (visibilitymap_truncate). > What bothers me is how to fix the problem for already affected standbys. If > the FSM for some table is already corrupted at the standby, users won't > notice it until the standby is promoted to be the new master. If the standby > starts throwing errors suddenly after failover, it will be a very bad > situation for the users, like we noticed with our customers. The fix is > simple and users can just delete the FSM (and VACUUM the table), but that > doesn't sound nice and they would not know until they see the problem. + /* + * See comments in GetPageWithFreeSpace about handling outside the valid + * range blocks + */ + nblocks = RelationGetNumberOfBlocks(rel); + while (target_block >= nblocks && target_block != InvalidBlockNumber) + { + target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0, + spaceNeeded); + } Hm. This is just a workaround. Even if things are done this way the FSM will remain corrupted. And isn't that going to break once the relation is extended again? I'd suggest instead putting in the release notes a query that allows one to analyze what are the relations broken and directly have them fixed. That's annoying, but it would be really better than a workaround. One idea here is to use pg_freespace() and see if it returns a non-zero value for an out-of-range block on a standby. I have also been thinking about the comment you added and we could just do something like that: + /* + * Mark the buffer still holding references to truncated blocks as + * dirty to be sure that this makes it to disk and is kept consistent + * with the truncated relation. + */ + MarkBufferDirty(buf) > It might also be a good idea to inspect other callers of > MarkBufferDirtyHint() and see if any of them is vulnerable, especially from > standby perspective. I did one round, and couldn't see another problem. Haven't looked at those yet, will do so tomorrow. At the same time, I have translated your script into a TAP test, I found that more useful when testing.. -- Michael
On Mon, Oct 10, 2016 at 11:25 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > At the same time, I have translated your script into a TAP test, I > found that more useful when testing.. Well... Here is the actual patch. -- Michael
Attachment
On Fri, Oct 7, 2016 at 11:50 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Could you please add the patches to commitfest? > I'm going to test them and write a review in a few days. Here you go: https://commitfest.postgresql.org/11/817/ -- Michael
On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
+ /*
+ * See comments in GetPageWithFreeSpace about handling outside the valid
+ * range blocks
+ */
+ nblocks = RelationGetNumberOfBlocks(rel);
+ while (target_block >= nblocks && target_block != InvalidBlockNumber)
+ {
+ target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+ spaceNeeded);
+ }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted.
No, because the code above updates the FSM of those out-of-the range blocks. But now that I look at it again, may be this is not correct and it may get into an endless loop if the relation is repeatedly extended concurrently.
And isn't that going to break once the
relation is extended again?
Once the underlying bug is fixed, I don't see why it should break again. I added the above code to mostly deal with already corrupt FSMs. May be we can just document and leave it to the user to run some correctness checks (see below), especially given that the code is not cheap and adds overheads for everybody, irrespective of whether they have or will ever have corrupt FSM.
I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.
Right, that's how I tested for broken FSMs. A challenge with any such query is that if the shared buffer copy of the FSM page is intact, then the query won't return problematic FSMs. Of course, if the fix is applied to the standby and is restarted, then corrupt FSMs can be detected.
At the same time, I have translated your script into a TAP test, I
found that more useful when testing..
Thanks for doing that.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> >> >> + /* >> + * See comments in GetPageWithFreeSpace about handling outside the >> valid >> + * range blocks >> + */ >> + nblocks = RelationGetNumberOfBlocks(rel); >> + while (target_block >= nblocks && target_block != InvalidBlockNumber) >> + { >> + target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0, >> + spaceNeeded); >> + } >> Hm. This is just a workaround. Even if things are done this way the >> FSM will remain corrupted. > > > No, because the code above updates the FSM of those out-of-the range blocks. > But now that I look at it again, may be this is not correct and it may get > into an endless loop if the relation is repeatedly extended concurrently. Ah yes, that's what the call for RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed that yesterday before sleeping. >> And isn't that going to break once the >> relation is extended again? > > > Once the underlying bug is fixed, I don't see why it should break again. I > added the above code to mostly deal with already corrupt FSMs. May be we can > just document and leave it to the user to run some correctness checks (see > below), especially given that the code is not cheap and adds overheads for > everybody, irrespective of whether they have or will ever have corrupt FSM. Yep. I'd leave it for the release notes to hold a diagnostic method. That's annoying, but this has been done in the past like for the multixact issues.. >> I'd suggest instead putting in the release >> notes a query that allows one to analyze what are the relations broken >> and directly have them fixed. That's annoying, but it would be really >> better than a workaround. One idea here is to use pg_freespace() and >> see if it returns a non-zero value for an out-of-range block on a >> standby. >> > > Right, that's how I tested for broken FSMs. A challenge with any such query > is that if the shared buffer copy of the FSM page is intact, then the query > won't return problematic FSMs. Of course, if the fix is applied to the > standby and is restarted, then corrupt FSMs can be detected. What if you restart the standby, and then do a diagnostic query? Wouldn't that be enough? (Something just based on pg_freespace(pg_relation_size(oid) / block_size) != 0) -- Michael
On Tue, Oct 11, 2016 at 5:20 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> Once the underlying bug is fixed, I don't see why it should break again. I
> added the above code to mostly deal with already corrupt FSMs. May be we can
> just document and leave it to the user to run some correctness checks (see
> below), especially given that the code is not cheap and adds overheads for
> everybody, irrespective of whether they have or will ever have corrupt FSM.
Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..
I'm okay with that. It's annoying, especially because the bug may show up when your primary is down and you just failed over for HA, only to find that the standby won't work correctly. But I don't have ideas how to fix existing corruption without adding significant penalty to normal path.
What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)
I think this is a major bug and I would appreciate any ideas to get the patch in a committable shape before the next minor release goes out. We probably need a committer to get interested in this to make progress.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 17, 2016 at 4:14 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > I think this is a major bug and I would appreciate any ideas to get the > patch in a committable shape before the next minor release goes out. We > probably need a committer to get interested in this to make progress. Same opinion here, this is very annoying for some of my internal users.. I don't have more to offer though. -- Michael
On 10/10/2016 05:25 PM, Michael Paquier wrote: > On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: >> I believe the fix is very simple. The FSM change during truncation is >> critical and the buffer must be marked by MarkBufferDirty() i.e. those >> changes must make to the disk. I think it's alright not to WAL log them >> because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must >> not be lost across a checkpoint. Also, since it happens only during relation >> truncation, I don't see any problem from performance perspective. > > Agreed. I happen to notice that VM is similalry careful when it comes > to truncate it (visibilitymap_truncate). visibilitymap_truncate is actually also wrong, in a different way. The truncation WAL record is written only after the VM (and FSM) are truncated. But visibilitymap_truncate() has already modified and dirtied the page. If the VM page change is flushed to disk before the WAL record, and you crash, you might have a torn VM page and a checksum failure. Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in FreeSpaceMapTruncateRel would have the same issue. If you call MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN to make sure the WAL record is flushed first. I think we need something like the attached. - Heikki
Attachment
On Mon, Oct 17, 2016 at 8:04 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > visibilitymap_truncate is actually also wrong, in a different way. The > truncation WAL record is written only after the VM (and FSM) are truncated. > But visibilitymap_truncate() has already modified and dirtied the page. If > the VM page change is flushed to disk before the WAL record, and you crash, > you might have a torn VM page and a checksum failure. Good point! I didn't think about that. > Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in > FreeSpaceMapTruncateRel would have the same issue. If you call > MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN > to make sure the WAL record is flushed first. Right.. All the code paths calling FreeSpaceMapTruncateRel or visibilitymap_truncate do flush the record, but it definitely make things more robust to set the LSN on the page. So +1 this way. > I think we need something like the attached. OK, I had a look at that. MarkBufferDirty(mapBuffer); + if (lsn != InvalidXLogRecPtr) + PageSetLSN(page, lsn); UnlockReleaseBuffer(mapBuffer); Nit: using XLogRecPtrIsInvalid() makes more sense for such checks? pg_visibility calls as well visibilitymap_truncate, but this patch did not update it. And it would be better to do the actual truncate after writing the WAL record I think. You are also breaking XLogFlush() in RelationTruncate() because vm and fsm are assigned thanks to a call of smgrexists(), something done before XLogFlush is called on HEAD, and after with your patch. So your patch finishes by never calling XLogFlush() on relation truncation even if there is a VM or a FSM. Attached is an updated version with those issues fixed. The final version does not need the test as that's really a corner-case, but I still included it to help testing for now. -- Michael
Attachment
On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
visibilitymap_truncate is actually also wrong, in a different way. The truncation WAL record is written only after the VM (and FSM) are truncated. But visibilitymap_truncate() has already modified and dirtied the page. If the VM page change is flushed to disk before the WAL record, and you crash, you might have a torn VM page and a checksum failure.
Right, I missed the problem with checksums.
Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in FreeSpaceMapTruncateRel would have the same issue. If you call MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN to make sure the WAL record is flushed first.
I'm not sure I fully understand this. If the page is flushed before the WAL record, we might have a truncated FSM where as the relation truncation is not replayed. But that's not the same problem, right? I mean, you might have an FSM which is not accurate, but it won't at the least return the blocks outside the range. Having said that, I agree your proposed changes are more robust.
BTW any thoughts on race-condition on the primary? Comments at MarkBufferDirtyHint() seems to suggest that a race condition is possible which might leave the buffer without the DIRTY flag, but I'm not sure if that can only happen when the page is locked in shared mode. While most of the reports so far involved standbys, and the bug can also hit a standalone master during crash recovery, I wonder if a race can occur even on a live system.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On 10/18/2016 07:01 AM, Pavan Deolasee wrote: > On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> visibilitymap_truncate is actually also wrong, in a different way. The >> truncation WAL record is written only after the VM (and FSM) are truncated. >> But visibilitymap_truncate() has already modified and dirtied the page. If >> the VM page change is flushed to disk before the WAL record, and you crash, >> you might have a torn VM page and a checksum failure. > > Right, I missed the problem with checksums. > >> Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in >> FreeSpaceMapTruncateRel would have the same issue. If you call >> MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN >> to make sure the WAL record is flushed first. > > I'm not sure I fully understand this. If the page is flushed before the WAL > record, we might have a truncated FSM where as the relation truncation is > not replayed. But that's not the same problem, right? I mean, you might > have an FSM which is not accurate, but it won't at the least return the > blocks outside the range. Having said that, I agree your proposed changes > are more robust. Actually, this is still not 100% safe. Flushing the WAL before modifying the FSM page is not enough. We also need to WAL-log a full-page image of the FSM page, otherwise we are still vulnerable to the torn page problem. I came up with the attached. This is fortunately much simpler than my previous attempt. I replaced the MarkBufferDirtyHint() calls with MarkBufferDirty(), to fix the original issue, plus WAL-logging a full-page image to fix the torn page issue. > BTW any thoughts on race-condition on the primary? Comments at > MarkBufferDirtyHint() seems to suggest that a race condition is possible > which might leave the buffer without the DIRTY flag, but I'm not sure if > that can only happen when the page is locked in shared mode. I think the race condition can only happen when the page is locked in shared mode. In any case, with this proposed fix, we'll use MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot. - Heikki
Attachment
On Wed, Oct 19, 2016 at 2:37 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Actually, this is still not 100% safe. Flushing the WAL before modifying the FSM page is not enough. We also need to WAL-log a full-page image of the FSM page, otherwise we are still vulnerable to the torn page problem.
I came up with the attached. This is fortunately much simpler than my previous attempt. I replaced the MarkBufferDirtyHint() calls with MarkBufferDirty(), to fix the original issue, plus WAL-logging a full-page image to fix the torn page issue.
Looks good to me.
BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode.
I think the race condition can only happen when the page is locked in shared mode. In any case, with this proposed fix, we'll use MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.
Yes, the fix will cover that problem (if it exists). The reason why I was curious to know is because there are several reports of similar error in the past and some of them did not involve as standby. Those reports mostly remained unresolved and I wondered if this explains them. But yeah, my conclusion was that the race is not possible with page locked in EXCLUSIVE mode. So may be there is another problem somewhere or a crash recovery may have left the FSM in inconsistent state.
Anyways, we seem good to go with the patch.
Thanks,
Pavan
-- Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On 10/19/2016 01:07 PM, Pavan Deolasee wrote: > Anyways, we seem good to go with the patch. Ok, committed. Thanks for the analysis! - Heikki
On 10/19/2016 02:29 PM, Heikki Linnakangas wrote: > On 10/19/2016 01:07 PM, Pavan Deolasee wrote: >> Anyways, we seem good to go with the patch. > > Ok, committed. Thanks for the analysis! Oh, forgot that this needs to be backported, of course. Will do that shortly... - Heikki
On Wed, Oct 19, 2016 at 8:29 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 10/19/2016 01:07 PM, Pavan Deolasee wrote: >> >> Anyways, we seem good to go with the patch. > > Ok, committed. Thanks for the analysis! Thanks! I am surprised that you kept the TAP test at the end. -- Michael
On 10/19/2016 02:32 PM, Heikki Linnakangas wrote: > On 10/19/2016 02:29 PM, Heikki Linnakangas wrote: >> On 10/19/2016 01:07 PM, Pavan Deolasee wrote: >>> Anyways, we seem good to go with the patch. >> >> Ok, committed. Thanks for the analysis! > > Oh, forgot that this needs to be backported, of course. Will do that > shortly... Done. This didn't include anything to cope with an already-corrupt FSM, BTW. Do we still want to try something for that? I think it's good enough if we prevent the FSM corruption from happening, but not sure what the consensus on that might be.. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > This didn't include anything to cope with an already-corrupt FSM, BTW. > Do we still want to try something for that? I think it's good enough if > we prevent the FSM corruption from happening, but not sure what the > consensus on that might be.. Can we document an existing procedure for repairing FSM corruption? (VACUUM, maybe?) We're going to need to be able to explain how to fix busted visibility maps in 9.6.1, too. regards, tom lane
On Wed, Oct 19, 2016 at 6:44 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:
Oh, forgot that this needs to be backported, of course. Will do that
shortly...
Done.
Thanks!
This didn't include anything to cope with an already-corrupt FSM, BTW. Do we still want to try something for that? I think it's good enough if we prevent the FSM corruption from happening, but not sure what the consensus on that might be..
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 19, 2016 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
--
Can we document an existing procedure for repairing FSM corruption?
(VACUUM, maybe?)
I'm afraid it may not be easy to repair the corruption with existing facilities. Most often the corruption will be on the standby and a VACUUM may not actually touch affected pages on the master (because they may not even exists on the master or skipped because of visibility maps). It may not even trigger relation truncation. So AFAICS it may not generate any WAL activity that can fix the corruption on the standby.
One possible way would be to delete the FSM (and VM) information on the master and standby and then run VACUUM so these maps are rebuilt. We obviously don't need to do this for all tables, but we need a way to find the tables with corrupt FSM [1].
Suggested procedure could be:
1. Upgrade master and standby to the latest minor release (which involves restart)
2. Install pg_freespace extension and run the query [1] on master to find possible corruption cases. The query checks if FSM reports free space in a block outside the size of the relation. Unfortunately, we might have false positives if the relation is extended while the query is running.
3. Repeat the same query on standby (if it's running in Hot standby mode, otherwise the corruption can only be detected once it's promoted to be a master)
4. Remove FSM and VM files for the affected tables (I don't think if it's safe to do this on a running server)
5. VACUUM affected tables so that FSM and VM is rebuilt.
Another idea is to implement a pg_freespace_repair() function in pg_freespace which takes an AccessExclusiveLock on the table and truncates it to it's current size, thus generating a WAL record that the standby will replay to fix the corruption. This probably looks more promising, easy to explain and less error prone.
[1] SELECT *
FROM (
SELECT oid::regclass as relname, EXISTS (
SELECT *
FROM (
SELECT blkno, pg_freespace(oid::regclass, blkno)
FROM generate_series(pg_relation_size(oid::regclass)/ current_setting('block_size')::bigint, pg_relation_size(oid::regclass, 'fsm') / 2) as blkno
) as avail
WHERE pg_freespace > 0
) as corrupt_fsm
FROM pg_class
WHERE relkind = 'r'
) b
WHERE b.corrupt_fsm = true;
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 20, 2016 at 2:11 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > 4. Remove FSM and VM files for the affected tables (I don't think if it's > safe to do this on a running server) Definitely not while the server is running... For VMs a good way would be to use pg_visibility's pg_truncate_visibility_map(), but only for 9.6~. For FSM there is no real solution, and actually a pg_truncate_fsm would prove to be useful here. So users would need to stop once the server if there are corrupted VMs or FSMs, delete them manually, and then run VACUUM to rebuild them. -- Michael
On Thu, Oct 20, 2016 at 10:50 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
For VMs a good way would
be to use pg_visibility's pg_truncate_visibility_map(), but only for
9.6~.
Ah ok..
For FSM there is no real solution, and actually a
pg_truncate_fsm would prove to be useful here.
Right, that's what I proposed as an alternate idea. I agree this is much cleaner and less error prone approach.
Actually, if we could add an API which can truncate FSM to the given heap block, then the user may not even need to run VACUUM, which could be costly for very large tables. Also, AFAICS we will need to backport pg_truncate_visibility_map() to older releases because unless the VM is truncated along with the FSM, VACUUM may not scan all pages and the FSM for those pages won't be recorded.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > Actually, if we could add an API which can truncate FSM to the given heap > block, then the user may not even need to run VACUUM, which could be costly > for very large tables. FreeSpaceMapTruncateRel()? > Also, AFAICS we will need to backport > pg_truncate_visibility_map() to older releases because unless the VM is > truncated along with the FSM, VACUUM may not scan all pages and the FSM for > those pages won't be recorded. This would need a careful lookup, and it would not be that complicated to implement. And this bug justifies the presence of a tool like that actually... But usually those don't get a backpatch, so the probability of getting that backported is low IMO, personally I am not sure I like that either. -- Michael
On Thu, Oct 20, 2016 at 11:34 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Actually, if we could add an API which can truncate FSM to the given heap
> block, then the user may not even need to run VACUUM, which could be costly
> for very large tables.
FreeSpaceMapTruncateRel()?
Right. We need a SQL callable function to invoke that.
> Also, AFAICS we will need to backport
> pg_truncate_visibility_map() to older releases because unless the VM is
> truncated along with the FSM, VACUUM may not scan all pages and the FSM for
> those pages won't be recorded.
This would need a careful lookup, and it would not be that complicated
to implement. And this bug justifies the presence of a tool like that
actually... But usually those don't get a backpatch, so the
probability of getting that backported is low IMO, personally I am not
sure I like that either.
Just to clarify, I meant if we truncate the entire FSM then we'll need API to truncate VM as well so that VACUUM rebuilds everything completely. OTOH if we provide a function just to truncate FSM to match the size of the table, then we don't need to rebuild the FSM. So that's probably a better way to handle FSM corruption, as far as this particular issue is concerned.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 20, 2016 at 3:37 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > Just to clarify, I meant if we truncate the entire FSM then we'll need API > to truncate VM as well so that VACUUM rebuilds everything completely. OTOH > if we provide a function just to truncate FSM to match the size of the > table, then we don't need to rebuild the FSM. So that's probably a better > way to handle FSM corruption, as far as this particular issue is concerned. To be honest, I think that just having in the release notes the method that does not involve the use any extra extension or SQL routine is fine. So we could just tell to users to: 1) Run something like the query you gave upthread, giving to the user a list of the files that are corrupted. And add this query to the release notes. 2) If anything is found, stop the server and delete the files manually. 3) Re-start the server. OK, that's troublesome and costly for large relations, but we know that's the safest way to go for any versions, and there is no need to complicate the code with any one-time repairing extensions. Speaking of which, I implemented a small extension able to truncate the FSM up to the size of the relation as attached, but as I looked at it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action is rather limited... And I pushed as well a version on github: https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation The limitation range of such an extension is a argument good enough to just rely on the stop/delete-FSM/start method to fix an instance and let VACUUM do the rest of the work. That looks to work but use it at your own risk. This bug would be a good blog topic by the way... -- Michael
Attachment
On 10/20/16 10:15 PM, Michael Paquier wrote: > 2) If anything is found, stop the server and delete the files manually. > 3) Re-start the server. > OK, that's troublesome and costly for large relations, but we know > that's the safest way to go for any versions, and there is no need to > complicate the code with any one-time repairing extensions. Wouldn't you need to run around to all your replicas and do that as well? > Speaking of which, I implemented a small extension able to truncate > the FSM up to the size of the relation as attached, but as I looked at > it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action > is rather limited... And I pushed as well a version on github: > https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation > The limitation range of such an extension is a argument good enough to > just rely on the stop/delete-FSM/start method to fix an instance and > let VACUUM do the rest of the work. That looks to work but use it at > your own risk. Shouldn't the truncation be logged before it's performed? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 10/20/16 10:15 PM, Michael Paquier wrote: >> >> 2) If anything is found, stop the server and delete the files manually. >> 3) Re-start the server. >> OK, that's troublesome and costly for large relations, but we know >> that's the safest way to go for any versions, and there is no need to >> complicate the code with any one-time repairing extensions. > > Wouldn't you need to run around to all your replicas and do that as well? Yeah... >> Speaking of which, I implemented a small extension able to truncate >> the FSM up to the size of the relation as attached, but as I looked at >> it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action >> is rather limited... And I pushed as well a version on github: >> https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation >> The limitation range of such an extension is a argument good enough to >> just rely on the stop/delete-FSM/start method to fix an instance and >> let VACUUM do the rest of the work. That looks to work but use it at >> your own risk. > > > Shouldn't the truncation be logged before it's performed? Doh. Yes, thanks for the reminder. And I commented on that upthread.. -- Michael
On Sat, Oct 22, 2016 at 7:31 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 10/20/16 10:15 PM, Michael Paquier wrote: >>> >>> 2) If anything is found, stop the server and delete the files manually. >>> 3) Re-start the server. >>> OK, that's troublesome and costly for large relations, but we know >>> that's the safest way to go for any versions, and there is no need to >>> complicate the code with any one-time repairing extensions. >> >> Wouldn't you need to run around to all your replicas and do that as well? > > Yeah... Release notes refer to this page for methods to fix corrupted instances: https://wiki.postgresql.org/wiki/Free_Space_Map_Problems I just typed something based on Pavan's upper method, feel free to jump in and improve it as necessary. -- Michael
I wrote: > It looks to me like this is approximating the highest block number that > could possibly have an FSM entry as size of the FSM fork (in bytes) > divided by 2. But the FSM stores one byte per block. There is overhead > for the FSM search tree, but in a large relation it's not going to be as > much as a factor of 2. So I think that to be conservative we need to > drop the "/ 2". Am I missing something? Ah, scratch that, after rereading the FSM README I see it's correct, because there's a binary tree within each page; I'd only remembered that there was a search tree of pages. Also, we could at least discount the FSM root page and first intermediate page, no? That is, the upper limit could be pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT I think this is a worthwhile improvement because it reduces the time spent on small relations. For me, the query as given takes 9 seconds to examine the regression database, which seems like a lot. Discounting two pages reduces that to 20 ms. regards, tom lane
On Mon, Oct 24, 2016 at 9:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
SELECT blkno, pg_freespace(oid::regclass, blkno)
FROM generate_series(pg_relation_size(oid::regclass) / current_setting('block_size'): :BIGINT,
pg_relation_size(oid::regclass, 'fsm') / 2) AS blkno
It looks to me like this is approximating the highest block number that
could possibly have an FSM entry as size of the FSM fork (in bytes)
divided by 2. But the FSM stores one byte per block. There is overhead
for the FSM search tree, but in a large relation it's not going to be as
much as a factor of 2. So I think that to be conservative we need to
drop the "/ 2". Am I missing something?
I went by these comments in fsm_internals.h, which suggest that the SlotsPerFSMPage are limited to somewhat less than BLCKSZ divided by 2.
/*
* Number of non-leaf and leaf nodes, and nodes in total, on an FSM page.
* These definitions are internal to fsmpage.c.
*/
#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
offsetof(FSMPageData, fp_nodes))
#define NonLeafNodesPerPage (BLCKSZ / 2 - 1)
#define LeafNodesPerPage (NodesPerPage - NonLeafNodesPerPage)
/*
* Number of FSM "slots" on a FSM page. This is what should be used
* outside fsmpage.c.
*/
#define SlotsPerFSMPage LeafNodesPerPage
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Michael Paquier <michael.paquier@gmail.com> writes: > Release notes refer to this page for methods to fix corrupted instances: > https://wiki.postgresql.org/wiki/Free_Space_Map_Problems > I just typed something based on Pavan's upper method, feel free to > jump in and improve it as necessary. Thanks for drafting that, but isn't the query wrong? Specifically the bit about SELECT blkno, pg_freespace(oid::regclass, blkno) FROM generate_series(pg_relation_size(oid::regclass) / current_setting('block_size')::BIGINT, pg_relation_size(oid::regclass, 'fsm') / 2) AS blkno It looks to me like this is approximating the highest block number that could possibly have an FSM entry as size of the FSM fork (in bytes) divided by 2. But the FSM stores one byte per block. There is overhead for the FSM search tree, but in a large relation it's not going to be as much as a factor of 2. So I think that to be conservative we need to drop the "/ 2". Am I missing something? regards, tom lane
On Mon, Oct 24, 2016 at 9:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Also, we could at least discount the FSM root page and first intermediate
page, no? That is, the upper limit could be
pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size' )::BIGINT
+1 for doing that.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane wrote: > Ah, scratch that, after rereading the FSM README I see it's correct, > because there's a binary tree within each page; I'd only remembered > that there was a search tree of pages. > > Also, we could at least discount the FSM root page and first intermediate > page, no? That is, the upper limit could be > > pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT > > I think this is a worthwhile improvement because it reduces the time spent > on small relations. For me, the query as given takes 9 seconds to examine > the regression database, which seems like a lot. Discounting two pages > reduces that to 20 ms. Hah, good one. We spent some time thinking about subtracting some value to make the value more accurate but it didn't occur to me to just use constant two. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Also, we could at least discount the FSM root page and first intermediate >> page, no? That is, the upper limit could be >> >> pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT >> >> I think this is a worthwhile improvement because it reduces the time spent >> on small relations. For me, the query as given takes 9 seconds to examine >> the regression database, which seems like a lot. Discounting two pages >> reduces that to 20 ms. > Hah, good one. We spent some time thinking about subtracting some value > to make the value more accurate but it didn't occur to me to just use > constant two. I got the arithmetic wrong in the above, it should be like (pg_relation_size(oid::regclass, 'fsm') - 2*current_setting('block_size')::BIGINT) / 2 With that, the runtime on HEAD's regression DB is about 700 ms, which is still a nice win over 9000 ms. I've put up draft wiki pages about these two problems at https://wiki.postgresql.org/wiki/Free_Space_Map_Problems https://wiki.postgresql.org/wiki/Visibility_Map_Problems (thanks to Michael Paquier for initial work on the first one). They're meant to be reasonably generic about FSM/VM problems rather than only being about our current bugs. Please review. regards, tom lane