Re: sequences vs. synchronous replication - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: sequences vs. synchronous replication
Date
Msg-id 0f827a71-a01b-bcf9-fe77-3047a9d4a93c@enterprisedb.com
Whole thread Raw
In response to Re: sequences vs. synchronous replication  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On 12/24/21 11:40, Tomas Vondra wrote:
> 
> ...
> 
> FWIW I plan to explore the idea of looking at sequence page LSN, and
> flushing up to that position.
> 

So, I explored the page LSN idea, and it seems to be working pretty
nicely. There still is some impact on the workload doing just nextval
calls, but it's much better than WAL-logging everything. The patch is
available at [1].

While working on that patch, I realized this actually affects even
systems without any replicas - it's trivial to get a sequence going
backwards. Imagine you do this:

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,50) s(i);
  ROLLBACK;

  SELECT nextval('s');

  -- kill -9 postgres

It's pretty likely a nextval('s') after restarting the server returns a
value from before the last nextval('s'), in case we do not manage to
flush the WAL before the kill.

The patch deals with this by updating XactLastRecEnd and then flushing
up to that point in RecordTransactionCommit(). But for that to happen,
we have to do the flush/wait even without a valid XID (which we may not
have when nextval gets called outside a transaction).

So I was wondering what other places do the same thing (generates WAL
without setting a XID), because that might either have similar issues
with not flushing data, and/or be affected by this change.

RecordTransactionCommit() says about such cases this:

  /*
   * Check if we want to commit asynchronously.  We can allow the
   * XLOG flush to happen asynchronously if synchronous_commit=off,
   * or if the current transaction has not performed any WAL-logged
   * operation or didn't assign an xid.  The transaction can end up
   * not writing any WAL, even if it has an xid, if it only wrote to
   * temporary and/or unlogged tables.  It can end up having written
   * WAL without an xid if it did HOT pruning.  In case of a crash,
   * the loss of such a transaction will be irrelevant; temp tables
   * will be lost anyway, unlogged tables will be truncated and HOT
   * pruning will be done again later. (Given the foregoing, you
   * might think that it would be unnecessary to emit the XLOG record
   * at all in this case, but we don't currently try to do that.  It
   * would certainly cause problems at least in Hot Standby mode,
   * where the KnownAssignedXids machinery requires tracking every
   * XID assignment.  It might be OK to skip it only when wal_level <
   * replica, but for now we don't.)
   *
   * However, if we're doing cleanup of any non-temp rels or
   * committing any command that wanted to force sync commit, then we
   * must flush XLOG immediately.  (We must not allow asynchronous
   * commit if there are any non-temp tables to be deleted, because
   * we might delete the files before the COMMIT record is flushed to
   * disk.  We do allow asynchronous commit if all to-be-deleted
   * tables are temporary though, since they are lost anyway if we
   * crash.)
   */

Note: This relates only to XLogFlush vs. XLogSetAsyncXactLSN, not about
waiting for sync standby. For that we ignore forceSyncCommit, which
seems a bit weird ...

Anyway, I was wondering what happens in practice, so I added very simple
logging to RecordTransactionCommit():

    if (wrote_log && !markXidCommitted)
        elog(WARNING, "not flushing at %X/%X",
             (uint32) (XactLastRecEnd >> 32),
             (uint32) XactLastRecEnd);

and then ran installcheck, which produces ~700 messages. Looking at the
WAL (last few records before the LSN reported by the log message), most
of this is related to HOT pruning (i.e. PRUNE), but there's plenty of
other WAL records. And I'm not sure if it's OK to just lose (some of)
those messages, as the comment claims for PRUNE.

It's quite possible I miss something basic/important, and everything is
fine and dandy, but here's a couple non-pruning examples - command
triggering the log message, along with the last few WAL records without
XID assigned right before RecordTransactionCommit() was called.

A more complete data set (full WAL dump, regression.diffs etc.) is
available at [2].

========================================================================

VACUUM ANALYZE num_exp_add;
---------------------------
VISIBLE cutoff xid 37114 flags 0x01, blkref #0: rel 1663/63341/3697 ...
INPLACE off 39, blkref #0: rel 1663/63341/1259 blk 5


SELECT proname, provolatile FROM pg_proc
       WHERE oid in ('functest_B_1'::regproc,
                     'functest_B_2'::regproc,
                     'functest_B_3'::regproc,
             'functest_B_4'::regproc) ORDER BY proname;
------------------------------------------------
VACUUM nunused 223, blkref #0: rel 1663/63341/2608 blk 39
VISIBLE cutoff xid 39928 flags 0x01, blkref #0: rel 1663/63341/2608 ...
VACUUM nunused 6, blkref #0: rel 1663/63341/2608 blk 40
META_CLEANUP last_cleanup_num_delpages 5, blkref #0: rel 1663/63341 ...
META_CLEANUP last_cleanup_num_delpages 1, blkref #0: rel 1663/63341 ...
INPLACE off 13, blkref #0: rel 1663/63341/1259 blk 4
INPLACE off 14, blkref #0: rel 1663/63341/1259 blk 4
INPLACE off 20, blkref #0: rel 1663/63341/1259 blk 8
INVALIDATIONS ; inval msgs: catcache 53 catcache 52 catcache 53 ...


EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c FROM plt1_adv t1 INNER JOIN plt2_adv
t2 ON (t1.a = t2.a AND t1.c = t2.c) WHERE t1.c IN ('0003', '0004',
'0005') AND t1.b < 10 ORDER BY t1.a;
------------------------------------------------
INPLACE off 11, blkref #0: rel 1663/63341/69386 blk 67
INVALIDATIONS ; inval msgs: catcache 53 catcache 52 relcache 82777


VACUUM FREEZE indtoasttest;
---------------------------
FREEZE_PAGE cutoff xid 47817 ntuples 4, blkref #0: rel 1663/63341 ...
VISIBLE cutoff xid 47816 flags 0x03, blkref #0: rel 1663/63341/ ...
INPLACE off 91, blkref #0: rel 1663/63341/69386 blk 37


SELECT brin_summarize_range('brin_summarize_multi_idx', 2);
-----------------------------------------------------------
INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ...
SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2


SELECT brin_desummarize_range('brinidx_multi', 0);
---------------------------------------------------
DESUMMARIZE pagesPerRange 1, heapBlk 0, page offset 9, blkref #0: ...


select gin_clean_pending_list('gin_test_idx')>10 as many;
------------------------------------------------------------------------
DELETE_LISTPAGE ndeleted: 16, blkref #0: rel 1663/63341/71933 blk ...
DELETE_LISTPAGE ndeleted: 16, blkref #0: rel 1663/63341/71933 blk ...
DELETE_LISTPAGE ndeleted: 11, blkref #0: rel 1663/63341/71933 blk ...


VACUUM no_index_cleanup;
------------------------------------------------------------------------
META_CLEANUP last_cleanup_num_delpages 0, blkref #0: rel 1663/63341 ...

========================================================================

I wonder if all those cases are subject to the same "we can lose those
records" just like PRUNE. I haven't expected to see e.g. the
BRIN-related records, but I'm more skeptical about cases with multiple
WAL records. Because how exactly we know we don't lose just some of
them? Those might go to two different WAL pages, and we manage to flush
just one of them? What happens if we keep the INPLACE but lose the
INVALIDATIONS message right after it? I'd bet that'll confuse the hell
out of logical decoding, for example.


[1]
https://www.postgresql.org/message-id/9fb080d5-f509-cca4-1353-fd9da85db1d2%40enterprisedb.com

[2]
https://drive.google.com/drive/folders/1NEjWCG0uCWkrxrp_YZQOzqDfHlfJI8_l?usp=sharing

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Chris Cleveland
Date:
Subject: More data files / forks
Next
From: Alvaro Herrera
Date:
Subject: Re: Column Filtering in Logical Replication