Re: sequences vs. synchronous replication - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: sequences vs. synchronous replication |
Date | |
Msg-id | 0f827a71-a01b-bcf9-fe77-3047a9d4a93c@enterprisedb.com Whole thread Raw |
In response to | Re: sequences vs. synchronous replication (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
List | pgsql-hackers |
On 12/24/21 11:40, Tomas Vondra wrote: > > ... > > FWIW I plan to explore the idea of looking at sequence page LSN, and > flushing up to that position. > So, I explored the page LSN idea, and it seems to be working pretty nicely. There still is some impact on the workload doing just nextval calls, but it's much better than WAL-logging everything. The patch is available at [1]. While working on that patch, I realized this actually affects even systems without any replicas - it's trivial to get a sequence going backwards. Imagine you do this: BEGIN; SELECT nextval('s') FROM generate_series(1,50) s(i); ROLLBACK; SELECT nextval('s'); -- kill -9 postgres It's pretty likely a nextval('s') after restarting the server returns a value from before the last nextval('s'), in case we do not manage to flush the WAL before the kill. The patch deals with this by updating XactLastRecEnd and then flushing up to that point in RecordTransactionCommit(). But for that to happen, we have to do the flush/wait even without a valid XID (which we may not have when nextval gets called outside a transaction). So I was wondering what other places do the same thing (generates WAL without setting a XID), because that might either have similar issues with not flushing data, and/or be affected by this change. RecordTransactionCommit() says about such cases this: /* * Check if we want to commit asynchronously. We can allow the * XLOG flush to happen asynchronously if synchronous_commit=off, * or if the current transaction has not performed any WAL-logged * operation or didn't assign an xid. The transaction can end up * not writing any WAL, even if it has an xid, if it only wrote to * temporary and/or unlogged tables. It can end up having written * WAL without an xid if it did HOT pruning. In case of a crash, * the loss of such a transaction will be irrelevant; temp tables * will be lost anyway, unlogged tables will be truncated and HOT * pruning will be done again later. (Given the foregoing, you * might think that it would be unnecessary to emit the XLOG record * at all in this case, but we don't currently try to do that. It * would certainly cause problems at least in Hot Standby mode, * where the KnownAssignedXids machinery requires tracking every * XID assignment. It might be OK to skip it only when wal_level < * replica, but for now we don't.) * * However, if we're doing cleanup of any non-temp rels or * committing any command that wanted to force sync commit, then we * must flush XLOG immediately. (We must not allow asynchronous * commit if there are any non-temp tables to be deleted, because * we might delete the files before the COMMIT record is flushed to * disk. We do allow asynchronous commit if all to-be-deleted * tables are temporary though, since they are lost anyway if we * crash.) */ Note: This relates only to XLogFlush vs. XLogSetAsyncXactLSN, not about waiting for sync standby. For that we ignore forceSyncCommit, which seems a bit weird ... Anyway, I was wondering what happens in practice, so I added very simple logging to RecordTransactionCommit(): if (wrote_log && !markXidCommitted) elog(WARNING, "not flushing at %X/%X", (uint32) (XactLastRecEnd >> 32), (uint32) XactLastRecEnd); and then ran installcheck, which produces ~700 messages. Looking at the WAL (last few records before the LSN reported by the log message), most of this is related to HOT pruning (i.e. PRUNE), but there's plenty of other WAL records. And I'm not sure if it's OK to just lose (some of) those messages, as the comment claims for PRUNE. It's quite possible I miss something basic/important, and everything is fine and dandy, but here's a couple non-pruning examples - command triggering the log message, along with the last few WAL records without XID assigned right before RecordTransactionCommit() was called. A more complete data set (full WAL dump, regression.diffs etc.) is available at [2]. ======================================================================== VACUUM ANALYZE num_exp_add; --------------------------- VISIBLE cutoff xid 37114 flags 0x01, blkref #0: rel 1663/63341/3697 ... INPLACE off 39, blkref #0: rel 1663/63341/1259 blk 5 SELECT proname, provolatile FROM pg_proc WHERE oid in ('functest_B_1'::regproc, 'functest_B_2'::regproc, 'functest_B_3'::regproc, 'functest_B_4'::regproc) ORDER BY proname; ------------------------------------------------ VACUUM nunused 223, blkref #0: rel 1663/63341/2608 blk 39 VISIBLE cutoff xid 39928 flags 0x01, blkref #0: rel 1663/63341/2608 ... VACUUM nunused 6, blkref #0: rel 1663/63341/2608 blk 40 META_CLEANUP last_cleanup_num_delpages 5, blkref #0: rel 1663/63341 ... META_CLEANUP last_cleanup_num_delpages 1, blkref #0: rel 1663/63341 ... INPLACE off 13, blkref #0: rel 1663/63341/1259 blk 4 INPLACE off 14, blkref #0: rel 1663/63341/1259 blk 4 INPLACE off 20, blkref #0: rel 1663/63341/1259 blk 8 INVALIDATIONS ; inval msgs: catcache 53 catcache 52 catcache 53 ... EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.a, t2.c FROM plt1_adv t1 INNER JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) WHERE t1.c IN ('0003', '0004', '0005') AND t1.b < 10 ORDER BY t1.a; ------------------------------------------------ INPLACE off 11, blkref #0: rel 1663/63341/69386 blk 67 INVALIDATIONS ; inval msgs: catcache 53 catcache 52 relcache 82777 VACUUM FREEZE indtoasttest; --------------------------- FREEZE_PAGE cutoff xid 47817 ntuples 4, blkref #0: rel 1663/63341 ... VISIBLE cutoff xid 47816 flags 0x03, blkref #0: rel 1663/63341/ ... INPLACE off 91, blkref #0: rel 1663/63341/69386 blk 37 SELECT brin_summarize_range('brin_summarize_multi_idx', 2); ----------------------------------------------------------- INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ... SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2 SELECT brin_desummarize_range('brinidx_multi', 0); --------------------------------------------------- DESUMMARIZE pagesPerRange 1, heapBlk 0, page offset 9, blkref #0: ... select gin_clean_pending_list('gin_test_idx')>10 as many; ------------------------------------------------------------------------ DELETE_LISTPAGE ndeleted: 16, blkref #0: rel 1663/63341/71933 blk ... DELETE_LISTPAGE ndeleted: 16, blkref #0: rel 1663/63341/71933 blk ... DELETE_LISTPAGE ndeleted: 11, blkref #0: rel 1663/63341/71933 blk ... VACUUM no_index_cleanup; ------------------------------------------------------------------------ META_CLEANUP last_cleanup_num_delpages 0, blkref #0: rel 1663/63341 ... ======================================================================== I wonder if all those cases are subject to the same "we can lose those records" just like PRUNE. I haven't expected to see e.g. the BRIN-related records, but I'm more skeptical about cases with multiple WAL records. Because how exactly we know we don't lose just some of them? Those might go to two different WAL pages, and we manage to flush just one of them? What happens if we keep the INPLACE but lose the INVALIDATIONS message right after it? I'd bet that'll confuse the hell out of logical decoding, for example. [1] https://www.postgresql.org/message-id/9fb080d5-f509-cca4-1353-fd9da85db1d2%40enterprisedb.com [2] https://drive.google.com/drive/folders/1NEjWCG0uCWkrxrp_YZQOzqDfHlfJI8_l?usp=sharing -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: