BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing - Mailing list pgsql-bugs

From ofir.manor@gmail.com
Subject BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing
Date
Msg-id 20151025191436.3017.36314@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      13725
Logged by:          Ofir Manor
Email address:      ofir.manor@gmail.com
PostgreSQL version: 9.4.5
Operating system:   CentOS 7.1 64-bit
Description:

Nasty but easily reproducible.

Test case, script 1.
I clean up the previous test, create a new table and two replication slot.
Then, I run a script in the background (it will trigger the bug), insert
100K rows, try to get them with pg_logical_slot_get_changes from the first
slot.
This works correctly every time - I get back the 100,000 rows + one BEGIN +
one COMMIT.
psql --quiet  << EOF
SELECT pg_drop_replication_slot('test1_slot');
SELECT pg_drop_replication_slot('test2_slot');
DROP TABLE test;
CREATE TABLE test (id int primary key, v varchar);
SELECT * FROM pg_create_logical_replication_slot('test1_slot',
'test_decoding');
SELECT * FROM pg_create_logical_replication_slot('test2_slot',
'test_decoding');
EOF
./get_loop.sh &
psql --quiet --tuples-only -o out1  << EOF
INSERT INTO  test SELECT i, (i*3)::varchar FROM generate_series(1,100000)
i;
SELECT * FROM pg_logical_slot_get_changes('test1_slot', NULL, NULL);
EOF
cat --squeeze-blank out1 | wc -l

Script 2 (get_loop.sh) continuously polls for changes using a loop of
pg_logical_slot_get_changes, using the second replication slot. Since the
test pg database is idle, only one call returns.

for i in `seq 1 10000`; do
  echo "SELECT * FROM pg_logical_slot_get_changes('test2_slot', NULL,
NULL);"
done | psql --quiet --tuples-only -o out2
cat --squeeze-blank out2 | wc -l

However, I get 116K or 148K rows in the output (after the 100K inserts).
I can clearly see the the output jumps back after 16K or 48K rows - starting
to stream the transaction from the beginning (see the first column - the
offsets):
...
 1/68F6E1B8 | 450854 | table public.test: INSERT: id[integer]:49149
v[character varying]:'147447'
 1/68F6E248 | 450854 | table public.test: INSERT: id[integer]:49150
v[character varying]:'147450'
 1/68F6E2D8 | 450854 | table public.test: INSERT: id[integer]:49151
v[character varying]:'147453'
 1/68F6E368 | 450854 | table public.test: INSERT: id[integer]:49152
v[character varying]:'147456'
 1/68891010 | 450854 | table public.test: INSERT: id[integer]:1 v[character
varying]:'3'
 1/688910D8 | 450854 | table public.test: INSERT: id[integer]:2 v[character
varying]:'6'
 1/68891168 | 450854 | table public.test: INSERT: id[integer]:3 v[character
varying]:'9'
 1/688911F8 | 450854 | table public.test: INSERT: id[integer]:4 v[character
varying]:'12'
...

When I have larger transaction (inserting 1M or 10M rows, it gets worse,
jumping back to the beginning multiple times.

BTW - I don't know if it is a bug in pg_logical_slot_get_changes or in the
test_decoding output plugin that I'm using to evaluate logical decoding.

Any fix / workaround?

Thanks in advance,
   Ofir

pgsql-bugs by date:

Previous
From: zkronion@gmail.com
Date:
Subject: BUG #13727: pg_dump with a connection string fails when performing parallel backup
Next
From: "Shulgin, Oleksandr"
Date:
Subject: Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing