[HACKERS] Issues with logical replication - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject [HACKERS] Issues with logical replication
Date
Msg-id ab31b954-34b8-763b-686d-5d1e7960b458@postgrespro.ru
Whole thread Raw
Responses Re: [HACKERS] Issues with logical replication
List pgsql-hackers
I have faced two issues with logical replication.
Reproducing scenario:

1. start two Postgres instances (I start both at local machine).
2. Initialize pgbench tables at both instances:    pgbench -i -s 10 postgres
3. Create publication of pgbench_accounts table at one node:    create publication pub for table pgbench_accounts;
4. Create subscription at another node:    delete from pgbench_accounts;    CREATE SUBSCRIPTION sub connection
'dbname=postgreshost=localhost 
 
port=5432 sslmode=disable' publication pub;    CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS 
TRIGGER AS $$    BEGIN    --  RETURN NEW;    END $$ LANGUAGE plpgsql;   CREATE TRIGGER replication_trigger BEFORE
INSERTOR UPDATE OR DELETE 
 
ON pgbench_accounts FOR EACH ROW EXECUTE PROCEDURE 
replication_trigger_proc();   ALTER TABLE pgbench_accounts ENABLE ALWAYS TRIGGER replication_trigger;
5. Start pgbench at primary node:    pgbench -T 1000 -P 2 -c 10 postgres


Please notice commented "return new" statement. Invocation of this 
function cause error and in log we see repeated messages:

2017-10-02 16:47:46.764 MSK [32129] LOG:  logical replication table 
synchronization worker for subscription "sub", table "pgbench_accounts" 
has started
2017-10-02 16:47:46.771 MSK [32129] ERROR:  control reached end of 
trigger procedure without RETURN
2017-10-02 16:47:46.771 MSK [32129] CONTEXT:  PL/pgSQL function 
replication_trigger_proc()    COPY pgbench_accounts, line 1: "1    1    0 "
2017-10-02 16:47:46.772 MSK [28020] LOG:  worker process: logical 
replication worker for subscription 17264 sync 17251 (PID 32129) exited 
with exit code 1
...


After few minutes of work primary node (with publication) is crashed 
with the following stack trace:

Program terminated with signal SIGABRT, Aborted.
#0  0x00007f3608f8ec37 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56    ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f3608f8ec37 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f3608f92028 in __GI_abort () at abort.c:89
#2  0x00000000009f5740 in ExceptionalCondition (conditionName=0xbf6b30 
"!(((xid) != ((TransactionId) 0)))",    errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c", 
lineNumber=582) at assert.c:54
#3  0x000000000086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0, 
oper=XLTW_None) at lmgr.c:582
#4  0x000000000081c9f7 in SnapBuildWaitSnapshot (running=0x2749198, 
cutoff=898498) at snapbuild.c:1400
#5  0x000000000081c7c7 in SnapBuildFindSnapshot (builder=0x2807910, 
lsn=798477760, running=0x2749198) at snapbuild.c:1311
#6  0x000000000081c339 in SnapBuildProcessRunningXacts 
(builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106
#7  0x000000000080a1c7 in DecodeStandbyOp (ctx=0x27ef870, 
buf=0x7ffd301858d0) at decode.c:314
#8  0x0000000000809ce1 in LogicalDecodingProcessRecord (ctx=0x27ef870, 
record=0x27efb30) at decode.c:117
#9  0x000000000080ddf9 in DecodingContextFindStartpoint (ctx=0x27ef870) 
at logical.c:458
#10 0x0000000000823968 in CreateReplicationSlot (cmd=0x27483a8) at 
walsender.c:934
#11 0x00000000008246ee in exec_replication_command (    cmd_string=0x27b9520 "CREATE_REPLICATION_SLOT 
\"sub_17264_sync_17251\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at 
walsender.c:1511
#12 0x000000000088eb44 in PostgresMain (argc=1, argv=0x275b738, 
dbname=0x275b578 "postgres", username=0x272b800 "knizhnik")    at postgres.c:4086
#13 0x00000000007ef9a1 in BackendRun (port=0x27532a0) at postmaster.c:4357
#14 0x00000000007ef0cb in BackendStartup (port=0x27532a0) at 
postmaster.c:4029
#15 0x00000000007eb68b in ServerLoop () at postmaster.c:1753
#16 0x00000000007eac77 in PostmasterMain (argc=3, argv=0x2729670) at 
postmaster.c:1361
#17 0x0000000000728552 in main (argc=3, argv=0x2729670) at main.c:228


Now fix the trigger function:
CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER AS $$
BEGIN  RETURN NEW;
END $$ LANGUAGE plpgsql;

And manually perform at master two updates inside one transaction:

postgres=# begin;
BEGIN
postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
UPDATE 1
postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
UPDATE 1
postgres=# commit;
<hangs>

and in replica log we see:
2017-10-02 18:40:26.094 MSK [2954] LOG:  logical replication apply 
worker for subscription "sub" has started
2017-10-02 18:40:26.101 MSK [2954] ERROR:  attempted to lock invisible 
tuple
2017-10-02 18:40:26.102 MSK [2882] LOG:  worker process: logical 
replication worker for subscription 16403 (PID 2954) exited with exit 
code 1

Error happens in trigger.c:

#3  0x000000000069bddb in GetTupleForTrigger (estate=0x2e36b50, 
epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,    lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40)
at
 
trigger.c:3103
#4  0x000000000069b259 in ExecBRUpdateTriggers (estate=0x2e36b50, 
epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,    fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
#5  0x00000000006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50, 
epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)    at execReplication.c:461
#6  0x0000000000820894 in apply_handle_update (s=0x7ffc442163b0) at 
worker.c:736
#7  0x0000000000820d83 in apply_dispatch (s=0x7ffc442163b0) at worker.c:892
#8  0x000000000082121b in LogicalRepApplyLoop (last_received=2225693736) 
at worker.c:1095
#9  0x000000000082219d in ApplyWorkerMain (main_arg=0) at worker.c:1647
#10 0x00000000007dd496 in StartBackgroundWorker () at bgworker.c:835
#11 0x00000000007f0889 in do_start_bgworker (rw=0x2d75f50) at 
postmaster.c:5680
#12 0x00000000007f0bc3 in maybe_start_bgworkers () at postmaster.c:5884
#13 0x00000000007efbeb in sigusr1_handler (postgres_signal_arg=10) at 
postmaster.c:5073
#14 <signal handler called>
#15 0x00007fbe83a054a3 in __select_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#16 0x00000000007eb517 in ServerLoop () at postmaster.c:1717
#17 0x00000000007eac48 in PostmasterMain (argc=3, argv=0x2d4e660) at 
postmaster.c:1361
#18 0x0000000000728523 in main (argc=3, argv=0x2d4e660) at main.c:228

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: [HACKERS] Commitfest 201709 is now closed
Next
From: Brent Dearth
Date:
Subject: [HACKERS] Horrible CREATE DATABASE Performance in High Sierra