Thread: BUG #16764: wal sender exits abnormally which cause wal receiever exits twice

BUG #16764: wal sender exits abnormally which cause wal receiever exits twice

From
PG Bug reporting form
Date:
06 December 2020, 10:12:16
The following bug has been logged on the website:

Bug reference:      16764
Logged by:          Bo Chen
Email address:      bchen90@163.com
PostgreSQL version: 11.8
Operating system:   Suse linux 11.3 x86_64
Description:

Hi，hackers

For hot standby mode, I constructed the following operations，kills the
walsender process while inserting some data to the test table. I found the
walreceiver process exits twice abnormally , obviously first exiting is
because the exit of the sender，but when the wal replication stream
reestablished，wal walreceiver killed by startup because invaild xlog
record.

I analysis the problem, it is found that the xlog record of the problem
location(across page record) changes before and after the sender exits,  and
the startup process of the standby has already loaded the previous page of
the changing record into memory before the exit of the receiver. As a
result, invalid xlog record is founded and the receiver process killed by
strartup. 

ps: I test 11.5 and the problem is still exists, I believe this problem has
been around for a long time.


Is tihis problem normally?

I'm worried about this problem could cause standby data damage in extreme
scenarios(for the same contrecord length and crc, but not the same xlog
record)


insert data procedure：
create or replace procedure test_procedure(in_count in integer)
as
$$
declare
    i_count integer;
begin
FOR i_count IN 1..in_count LOOP
    INSERT INTO test_table

VALUES(i_count,'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');
end loop;
delete from test_table;
end;
$$ language plpgsql;


standby database runing log:

2020-11-27 12:14:13.288 CGP [38116] DEBUG:  sendtime 2020-11-27
12:14:13.287768+00 receipttime 2020-11-27 12:14:13.288081+00 replication
apply delay 26585 ms transfer latency 0 ms
2020-11-27 12:14:13.288 CGP [38116] DEBUG:  sending write 1/B1FB8E0 flush
1/B1F4FA0 apply 1/B1F4EF8
2020-11-27 12:14:13.288 CGP [38116] DEBUG:  sending write 1/B1FB8E0 flush
1/B1FB8E0 apply 1/B1F4EF8
2020-11-27 12:14:13.289 CGP [7804] LOG:  invalid contrecord length 152 at
1/B1F4EF8
2020-11-27 12:14:13.289 CGP [38116] FATAL:  terminating walreceiver process
due to administrator command
2020-11-27 12:14:13.289 CGP [38116] DEBUG:  shmem_exit(1): 1
before_shmem_exit callbacks to make
2020-11-27 12:14:13.289 CGP [38116] DEBUG:  shmem_exit(1): 5 on_shmem_exit
callbacks to make
2020-11-27 12:14:13.289 CGP [38116] DEBUG:  proc_exit(1): 2 callbacks to
make
2020-11-27 12:14:13.289 CGP [38116] DEBUG:  exit(1)
2020-11-27 12:14:13.289 CGP [38116] DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
2020-11-27 12:14:13.289 CGP [38116] DEBUG:  shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2020-11-27 12:14:13.289 CGP [38116] DEBUG:  proc_exit(-1): 0 callbacks to
make
2020-11-27 12:14:13.289 CGP [7804] DEBUG:  switched WAL source from stream
to archive after failure
2020-11-27 12:14:13.289 CGP [7804] DEBUG:  transaction ID wrap limit is
2147484208, limited by database with OID 1
2020-11-27 12:14:13.289 CGP [7804] CONTEXT:  WAL redo at 1/B1F4EF8 for
XLOG/CHECKPOINT_SHUTDOWN: redo 1/B1F4EF8; tli 1; prev tli 1; fpw true; xid
0:99176; oid 24585; multi 1; offset 0; oldest xid 561 in DB 1; oldest multi
1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown