Re: Incorrect snapshots while promoting hot standby node when 2PC is used - Mailing list pgsql-hackers
From | Andrey Borodin |
---|---|
Subject | Re: Incorrect snapshots while promoting hot standby node when 2PC is used |
Date | |
Msg-id | 9AE6D0CE-D423-41A4-831A-2571EFA0FC07@yandex-team.ru Whole thread Raw |
In response to | Incorrect snapshots while promoting hot standby node when 2PC is used (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Incorrect snapshots while promoting hot standby node when 2PC is used
|
List | pgsql-hackers |
Hi Andres! > 23 апр. 2021 г., в 01:36, Andres Freund <andres@anarazel.de> написал(а): > > So snapshots within that window will always be "empty", i.e. xmin is > latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll > look at the procarray, which then leads xmin going back to 588. > > > I think that this can lead to data corruption, because a too new xmin horizon > could lead to rows from a prepared transaction getting hint bitted as dead (or > perhaps even pruned, although that's probably harder to hit). Due to the too > new xmin horizon we won't treat rows by the prepared xact as in-progress, and > TransactionIdDidCommit() will return false, as the transaction didn't commit > yet. Which afaict can result in row versions created by the prepared > transaction being invisible even after the prepared transaction commits. > > Without prepared transaction there probably isn't an issue, because there > shouldn't be any other in-progress xids at that point? I'm investigating somewhat resemblant case. We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with2pc. Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6. In January heapcheck of backup reported some suspicious problems ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=1 ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470241, offnum(tuple)=1 ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470242, offnum(tuple)=1 ... and so on for ~100 pages - tuples with lp==1 were not frozen. We froze tuples with pg_dirty_hands and run VACUUM (DSIABLE_PAGE_SKIPPING) on the table. In the end of the March the same shard stroke again with: ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42 .... around ~1040 blocks (starting from the same 1470240!) had non-frozen tuple at lp==42. I've run update s3.objects_65 set created = created where ctid = '(1470241,42)' returning *; After that heapcheck showed VM problem ERROR: XX001: Found non all-visible tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42 LOCATION: collect_corrupt_items, heap_check.c:186 VACUUM fixed it with warnings. WARNING: 01000: page is not marked all-visible but visibility map bit is set in relation "objects_65" page 1470240 and failed on next page ERROR: XX001: found xmin 1650436694 from before relfrozenxid 1752174172 LOCATION: heap_prepare_freeze_tuple, heapam.c:6172 I run update from all tuples in heapcheks ctid list and subsequent vacuum (without page skipping). This satisfied corruptionmonitoring. Can this case be related to the problem that you described? Or, perhaps, it looks more like a hardware problem? Data_checksums are on, but few years ago we observed ssd firmware thatwas loosing updates, but passing checksums. I'm sure that we would benefit from having separate relation fork for checksumsor LSNs. We observe similar cases 3-5 times a year. To the date no data was lost due to this, but it's somewhat annoying. BTW I'd say that such things are an argument for back-porting pg_surgery and heapcheck to old versions. Thanks! Best regards, Andrey Borodin.
pgsql-hackers by date: