BUG #16517: Fast Shutdown sometimes is not working if we use logical decoding. - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #16517: Fast Shutdown sometimes is not working if we use logical decoding. |
Date | |
Msg-id | 16517-db20030710e402fa@postgresql.org Whole thread Raw |
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 16517 Logged by: Евтушок Максим Email address: max.evtushok@gmail.com PostgreSQL version: 11.8 Operating system: Linux Description: I use a logical decoding feature (pgoutput plugin on the sender's side and pgjdbc with Debezium on the receiver's side). After insert + commit I wait for some time, watching in Postgres debug log for values of write+flush+apply LSN positions (walsender.c). Write position of walsender continues to grow because new data arrives into WAL. But flush and apply positions freeze because no new data transferred to the receiver. It blocks fast shutdown of Postgres because there is a check that the write position should be equal to the flush position in WalSndDone procedure (https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/replication/walsender.c#L2852) I think that the write position should not be incremented if no data sent to a receiver. Example: Postgres log some time after insert: 2020-06-26 10:28:26.526 UTC [767983] PostgreSQL JDBC Driver 185.193.196.40(2838) DEBUG: write 0/1B1959E0 flush 0/1B1958C8 apply 0/1B1958C8 Waldump: /usr/local/pgsql/data$ pg_waldump -s 0/1B1935C8 rmgr: Sequence len (rec/tot): 99/ 99, tx: 752659, lsn: 0/1B1935C8, prev 0/1B193590, desc: LOG rel 1663/16384/32801, blkref #0: rel 1663/16384/32801 blk 0 rmgr: Heap len (rec/tot): 75/ 583, tx: 752659, lsn: 0/1B193630, prev 0/1B1935C8, desc: INSERT off 11, blkref #0: rel 1663/16384/32803 blk 2119 FPW rmgr: Btree len (rec/tot): 53/ 8193, tx: 752659, lsn: 0/1B193878, prev 0/1B193630, desc: INSERT_LEAF off 405, blkref #0: rel 1663/16384/32807 blk 1076 FPW rmgr: Transaction len (rec/tot): 46/ 46, tx: 752659, lsn: 0/1B195898, prev 0/1B193878, desc: COMMIT 2020-06-26 10:23:59.694483 UTC rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/1B1958C8, prev 0/1B195898, desc: RUNNING_XACTS nextXid 752660 latestCompletedXid 752659 oldestRunningXid 752660 rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/1B195900, prev 0/1B1958C8, desc: RUNNING_XACTS nextXid 752660 latestCompletedXid 752659 oldestRunningXid 752660 rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/1B195938, prev 0/1B195900, desc: CHECKPOINT_ONLINE redo 0/1B195900; tli 1; prev tli 1; fpw true; xid 0:752660; oid 32811; multi 1; offset 0; oldest xid 561 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 752660; online rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/1B1959A8, prev 0/1B195938, desc: RUNNING_XACTS nextXid 752660 latestCompletedXid 752659 oldestRunningXid 752660 pg_waldump: FATAL: error in WAL record at 0/1B1959A8: invalid record length at 0/1B1959E0: wanted 24, got 0 Last message on network with WAL data is a commit message (XLogData (B) https://www.postgresql.org/docs/11/protocol-replication.html): 0040 64 00 00 00 37 77 00 00 00 00 1b 19 58 c8 d...7w......XÈ 0050 00 00 00 00 1b 19 58 c8 00 02 4b f8 e0 40 2d ab ......XÈ..Køà@-« 0060 43 00 00 00 00 00 1b 19 58 98 00 00 00 00 1b 19 C.......X....... 0070 58 c8 00 02 4b f8 e0 40 26 93 XÈ..Køà@&. Write position here - 0/1B1958C8 No new XLogData arrived, but keepalive LSN keep incrementing after the last commit message: 0040 64 00 00 00 16 6b 00 00 00 00 1b 19 59 e0 d....k......Yà 0050 00 02 4b f8 ef 22 6c d6 00 ..Køï"lÖ. Write position here - 0/1B1959E0 So there are increments of LSN not visible to receiver that couldn't be flushed. Because of that there is big probability that during fast shutdown keep alive will have bigger LSN in keep-alive than flushed LSN on a receiver and fast shutdown won't be possible.
pgsql-bugs by date:
Previous
From: PG Bug reporting formDate:
Subject: BUG #16516: when testing jit get terminate called after throwing an instance of 'std::bad_function_call'
Next
From: Hamid AkhtarDate:
Subject: Re: BUG #16516: when testing jit get terminate called after throwing an instance of 'std::bad_function_call'