Re: Server Crash while executing pg_replication_slot_advance (second time) - Mailing list pgsql-hackers
From | Arseny Sher |
---|---|
Subject | Re: Server Crash while executing pg_replication_slot_advance (second time) |
Date | |
Msg-id | 873720e4hf.fsf@ars-thinkpad Whole thread Raw |
In response to | Server Crash while executing pg_replication_slot_advance (secondtime) (tushar <tushar.ahuja@enterprisedb.com>) |
Responses |
Re: Server Crash while executing pg_replication_slot_advance (second time)
|
List | pgsql-hackers |
Hello, I confirm this bug. The idea is that while usually we start decoding from safe data.restart_lsn point, here we don't care about consistent snapshots and rush into decoding right away from data.confirmed_flush (slotfunc.c:475). The latter points to the first page's header instead of valid record if we log SWITCH record and confirm it without any additional records (while doing this manually you'd better hurry up to outrun xl_running_xacts slipping through). This can be reproduced with a bit simpler sample: SELECT * FROM pg_create_logical_replication_slot( 'regression_slot1', 'test_decoding', true); select pg_switch_wal(); -- ensure we are on clean segment and xl_running_xacts didn't slip yet select pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 0/2000000 (1 row) -- set confirmed_flush_lsn to segment start and verify that select pg_replication_slot_advance('regression_slot1', '0/6000000'); select slot_name, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn from pg_replication_slots; slot_name | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn ------------------+------+--------------+-------------+--------------------- regression_slot1 | | 557 | 0/15D0EE8 | 0/2000000 (1 row) -- insert some wal to advance GetFlushRecPtr create table t (i int); -- now start decoding from start of segment, boom select pg_replication_slot_advance('regression_slot1', '0/6000000'); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. : @:!> : @:!> : @:!> \q Simple way to fix that is to start decoding traditionally from restart_lsn which certainly points to valid record. Attached patch shows the idea. However, I am not sure this is a right decision: decoding from restart_lsn is slower since it lags behind confirmed_lsn; and we really don't care about consistent snapshots in this function. Well, it would be good if we assemble one on our way, serialize it and advance restart_lsn -- and this is AFAIU the main reason we ever decode something at all instead of just calling LogicalConfirmReceivedLocation -- but that's not the main goal of the function. Another approach is to notice pointer to page start and replace it with ptr to first record, but that doesn't sound elegantly. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company From d3db37bca231beb0081567ef818ac1ec852cbb1a Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Sat, 17 Feb 2018 04:39:09 +0300 Subject: [PATCH] Start decoding from restart_lsn which definitely exists. --- src/backend/replication/slotfuncs.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 7302f36488..e7d0ef49bc 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -472,7 +472,10 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) /* Acquire the slot so we "own" it */ ReplicationSlotAcquire(NameStr(*slotname), true); - startlsn = MyReplicationSlot->data.confirmed_flush; + if (OidIsValid(MyReplicationSlot->data.database)) + startlsn = MyReplicationSlot->data.restart_lsn; + else + startlsn = MyReplicationSlot->data.confirmed_flush; if (moveto < startlsn) { ReplicationSlotRelease(); -- 2.11.0
pgsql-hackers by date: