Thread: FSM corruption leading to errors

FSM corruption leading to errors

From

Pavan Deolasee

Date:

06 October 2016, 18:00:04

I investigated a bug report from one of our customers and it looked very similar to previous bug reports here [1], [2], [3] (and probably more). In these reports, the error looks something like this:

ERROR: could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The function returns without setting the DIRTY bit on the standby:

3413 /*

3414 * If we're in recovery we cannot dirty a page because of a hint.

3415 * We can set the hint, just not dirty the page as a result so the

3416 * hint is lost when we evict the page or shutdown.

3417 *

3418 * See src/backend/storage/page/README for longer discussion.

3419 */

3420 if (RecoveryInProgress())

3421 return;

3422

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to the FSM. I think that's usually alright because FSM changes are not WAL logged and if FSM ever returns a block with less free space than the caller needs, the caller is usually prepared to update the FSM and request for a new block. But if it returns a block that is outside the size of the relation, then we've a trouble. The very next ReadBuffer() fails to handle such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove references to the heap blocks that are being truncated. But since the FSM buffer may not be marked DIRTY on the standby, if the buffer gets evicted from the buffer cache, the on-disk copy of the FSM page may be left with references to the truncated heap pages. When the standby is later promoted to be the master, and an insert/update is attempted to the table, the FSM may return a block that is outside the valid range of the relation. That results in the said error.

Once this was clear, it was easy to put together a fully reproducible test case. See the attached script; you'll need to adjust to your environment. This affects all releases starting 9.3 and the script can reproduce the problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is critical and the buffer must be marked by MarkBufferDirty() i.e. those changes must make to the disk. I think it's alright not to WAL log them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must not be lost across a checkpoint. Also, since it happens only during relation truncation, I don't see any problem from performance perspective.

What bothers me is how to fix the problem for already affected standbys. If the FSM for some table is already corrupted at the standby, users won't notice it until the standby is promoted to be the new master. If the standby starts throwing errors suddenly after failover, it will be a very bad situation for the users, like we noticed with our customers. The fix is simple and users can just delete the FSM (and VACUUM the table), but that doesn't sound nice and they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is outside the range and discard such blocks after setting the FSM (attached patch does that). The problem with that approach is that RelationGetNumberOfBlocks() is not really cheap and invoking it everytime FSM is consulted may not be a bright idea. Can we cache that value in the RelationData or some such place (BulkInsertState?) and use that as a hint for quickly checking if the block is (potentially) outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the real problem, until I saw RecoveryInProgress() specific code, is: can this also affect stand-alone masters? The comments at MarkBufferDirtyHint() made me think so:

3358 * 3. This function does not guarantee that the buffer is always marked dirty

3359 * (due to a race condition), so it cannot be used for important changes.

So I was working with a theory that somehow updates to the FSM page are lost because the race mentioned in the comment actually kicks in. But I'm not sure if the race is only possible when the caller is holding a SHARE lock on the buffer. When the FSM is truncated, the caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not reproduce the issue on a stand-alone master. But probably worth checking.

It might also be a good idea to inspect other callers of MarkBufferDirtyHint() and see if any of them is vulnerable, especially from standby perspective. I did one round, and couldn't see another problem.

Thanks,

Pavan

[1] https://www.postgresql.org/message-id/CAJakt-8%3DaXa-F7uFeLAeSYhQ4wFuaX3%2BytDuDj9c8Gx6S_ou%3Dw%40mail.gmail.com

[2] https://www.postgresql.org/message-id/20160601134819.30392.85324@wrigleys.postgresql.org

[3] https://www.postgresql.org/message-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0%40AMSPR06MB504.eurprd06.prod.outlook.com

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Thread: FSM corruption leading to errors

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment