BUG #16517: Fast Shutdown sometimes is not working if we use logical decoding. - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #16517: Fast Shutdown sometimes is not working if we use logical decoding.
Date
Msg-id 16517-db20030710e402fa@postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      16517
Logged by:          Евтушок Максим
Email address:      max.evtushok@gmail.com
PostgreSQL version: 11.8
Operating system:   Linux
Description:

I use a logical decoding feature (pgoutput plugin on the sender's side and
pgjdbc with Debezium on the receiver's side).

After insert + commit I wait for some time, watching in Postgres debug log
for values of write+flush+apply LSN positions (walsender.c).

Write position of walsender continues to grow because new data arrives into
WAL. But flush and apply positions freeze because no new data transferred to
the receiver.

It blocks fast shutdown of Postgres because there is a check that the write
position should be equal to the flush position in WalSndDone procedure
(https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/replication/walsender.c#L2852)

I think that the write position should not be incremented if no data sent to
a receiver.

Example:

Postgres log some time after insert:
    2020-06-26 10:28:26.526 UTC [767983] PostgreSQL JDBC Driver
185.193.196.40(2838) DEBUG:  write 0/1B1959E0 flush 0/1B1958C8 apply
0/1B1958C8


Waldump:
    /usr/local/pgsql/data$ pg_waldump -s 0/1B1935C8
    rmgr: Sequence    len (rec/tot):     99/    99, tx:     752659, lsn:
0/1B1935C8, prev 0/1B193590, desc: LOG rel 1663/16384/32801, blkref #0: rel
1663/16384/32801 blk 0
    rmgr: Heap        len (rec/tot):     75/   583, tx:     752659, lsn:
0/1B193630, prev 0/1B1935C8, desc: INSERT off 11, blkref #0: rel
1663/16384/32803 blk 2119 FPW
    rmgr: Btree       len (rec/tot):     53/  8193, tx:     752659, lsn:
0/1B193878, prev 0/1B193630, desc: INSERT_LEAF off 405, blkref #0: rel
1663/16384/32807 blk 1076 FPW
    rmgr: Transaction len (rec/tot):     46/    46, tx:     752659, lsn:
0/1B195898, prev 0/1B193878, desc: COMMIT 2020-06-26 10:23:59.694483 UTC
    rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
0/1B1958C8, prev 0/1B195898, desc: RUNNING_XACTS nextXid 752660
latestCompletedXid 752659 oldestRunningXid 752660
    rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
0/1B195900, prev 0/1B1958C8, desc: RUNNING_XACTS nextXid 752660
latestCompletedXid 752659 oldestRunningXid 752660
    rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn:
0/1B195938, prev 0/1B195900, desc: CHECKPOINT_ONLINE redo 0/1B195900; tli 1;
prev tli 1; fpw true; xid 0:752660; oid 32811; multi 1; offset 0; oldest xid
561 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid:
0/0; oldest running xid 752660; online
    rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
0/1B1959A8, prev 0/1B195938, desc: RUNNING_XACTS nextXid 752660
latestCompletedXid 752659 oldestRunningXid 752660
    pg_waldump: FATAL:  error in WAL record at 0/1B1959A8: invalid record
length at 0/1B1959E0: wanted 24, got 0


Last message on network with WAL data is a commit message (XLogData (B)
https://www.postgresql.org/docs/11/protocol-replication.html):
    0040         64 00 00 00 37 77 00 00 00 00 1b 19 58 c8
d...7w......XÈ
    0050   00 00 00 00 1b 19 58 c8 00 02 4b f8 e0 40 2d ab
......XÈ..Køà@-«
    0060   43 00 00 00 00 00 1b 19 58 98 00 00 00 00 1b 19
C.......X.......
    0070   58 c8 00 02 4b f8 e0 40 26 93                     XÈ..Køà@&.
Write position here - 0/1B1958C8


No new XLogData arrived, but keepalive LSN keep incrementing after the last
commit message:
    0040         64 00 00 00 16 6b 00 00 00 00 1b 19 59 e0
d....k......Yà
    0050   00 02 4b f8 ef 22 6c d6 00                        ..Køï"lÖ.
Write position here - 0/1B1959E0

So there are increments of LSN not visible to receiver that couldn't be
flushed. 

Because of that there is big probability that during fast shutdown keep
alive will have bigger LSN in keep-alive than flushed LSN on a receiver and
fast shutdown won't be possible.


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16516: when testing jit get terminate called after throwing an instance of 'std::bad_function_call'
Next
From: Hamid Akhtar
Date:
Subject: Re: BUG #16516: when testing jit get terminate called after throwing an instance of 'std::bad_function_call'