Re: Recovery inconsistencies, standby much larger than primary - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Recovery inconsistencies, standby much larger than primary
Date
Msg-id CAM-w4HO_iqo97VPCs1ZigQk7MGePFjWUZf=wpsTPy-4izhiMJA@mail.gmail.com
Whole thread Raw
In response to Re: Recovery inconsistencies, standby much larger than primary  (Greg Stark <stark@mit.edu>)
Responses Re: Recovery inconsistencies, standby much larger than primary  (Greg Stark <stark@mit.edu>)
Re: Recovery inconsistencies, standby much larger than primary  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
So I think I've come up with a scenario that could cause this. I don't
think it's exactly what happened here but maybe something analogous
happened with our base backup restore.

On the primary you extend a table a bunch, including adding new
segments, but crash before committing (or checkpointing). Then some of
the blocks but not all may be written to disk. Assume they're all
written except for the last block of the first file. So what you have
is a .999G file followed by, day, 9 1G files. (Or maybe the hot backup
process could just catch the files in this state if a table is rapidly
growing and it doesn't take care to avoid picking up new files that
appear after it starts?)

smgrnblocks() stops at the first < 1GB segment and ignores the rest.
This code in xlog uses it to calculate how many blocks to add but it
only calls it once and then doesn't recheck where it's at as it
extends the relation. As soon as it adds that one missing block the
remaining files become visible. P_NEW always recalculates the position
based on smgrnblocks each time (which sounds pretty  inefficient but
anyways....) so it will add the requested blocks to the new end.

Now this isn't enough to explain things since surely the extensions
records would be in the xlog in physical order. But this could have
all happened after an earlier vacuum truncated the relation and we
could be replaying records that predate that.

So in short, if you have a 10G table and want to overwrite the last
block but the first segment is one block short then xlog will add 9G
to the end and write the block there. That sounds like what we've
seen.

I think the easy fix is to change the code in xlogutils to be more
defensive and stop as soon as it finds BufferGetBlockNumber(buffer) ==
blkno (which is what it has in the assert already).
-- 
greg



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [9.3 bug] disk space in pg_xlog increases during archive recovery
Next
From: Christian Kruse
Date:
Subject: Re: Patch: show xid and xmin in pg_stat_activity and pg_stat_replication