Re: wal_consistency_checking reports an inconsistency on masterbranch - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: wal_consistency_checking reports an inconsistency on masterbranch
Date
Msg-id 1a37a9bb-4a0f-80eb-773f-0ebd3e1a7b79@iki.fi
Whole thread Raw
In response to Re: wal_consistency_checking reports an inconsistency on masterbranch  (Michael Paquier <michael@paquier.xyz>)
Responses Re: wal_consistency_checking reports an inconsistency on masterbranch
List pgsql-hackers
On 13/04/18 13:08, Michael Paquier wrote:
> On Fri, Apr 13, 2018 at 02:15:35PM +0530, amul sul wrote:
>> I have looked into this and found that the issue is in heap_xlog_delete -- we
>> have missed to set the correct offset number from the target_tid when
>> XLH_DELETE_IS_PARTITION_MOVE flag is set.
> 
> Oh, this looks good to me.  So when a row was moved across partitions
> this could have caused incorrect tuple references on a standby, which
> could have caused corruptions.

Hmm. So, the problem was that HeapTupleHeaderSetMovedPartitions() only 
sets the block number to InvalidBlockNumber, and leaves the offset 
number unchanged. WAL replay didn't preserve the offset number, so the 
master and the standby had a different offset number in the ctid.

Why does HeapTupleHeaderSetMovedPartitions() leave the offset number 
unchanged? The old offset number is meaningless without the block 
number. Also, bits and magic values in the tuple header are scarce. 
We're squandering a whole range of values in the ctid, everything with 
ip_blkid==InvalidBlockNumber, to mean "moved to different partition", 
when a single value would suffice.

Let's tighten that up. In the attached (untested) patch, I changed the 
macros so that "moved to different partition" is indicated by the magic 
TID (InvalidBlockNumber, 0xfffd). Offset number 0xfffe was already used 
for speculative insertion tokens, so this follows that precedent.

I kept using InvalidBlockNumber there, so ItemPointerIsValid() still 
considers those item pointers as invalid. But my gut feeling is actually 
that it would be better to use e.g. 0 as the block number, so that these 
item pointers would appear valid. Again, to follow the precedent of 
speculative insertion tokens. But I'm not sure if there was some 
well-thought-out reason to make them appear invalid. A comment on that 
would be nice, at least.

(Amit hinted at this in 
https://www.postgresql.org/message-id/CAA4eK1KtsTqsGDggDCrz2O9Jgo7ma-Co-B8%2Bv3L2zWMA2NHm6A%40mail.gmail.com. 
He was OK with the current approach, but I feel pretty strongly that we 
should also set the offset number.)

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: minor fix for acquire_inherited_sample_rows
Next
From: John Naylor
Date:
Subject: lingering references to V0 calling convention