Thread: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

ofir.manor@gmail.com

Date:

26 October 2015, 06:06:02

The following bug has been logged on the website:

Bug reference:      13725
Logged by:          Ofir Manor
Email address:      ofir.manor@gmail.com
PostgreSQL version: 9.4.5
Operating system:   CentOS 7.1 64-bit
Description:

Nasty but easily reproducible.

Test case, script 1.
I clean up the previous test, create a new table and two replication slot.
Then, I run a script in the background (it will trigger the bug), insert
100K rows, try to get them with pg_logical_slot_get_changes from the first
slot.
This works correctly every time - I get back the 100,000 rows + one BEGIN +
one COMMIT.
psql --quiet  << EOF
SELECT pg_drop_replication_slot('test1_slot');
SELECT pg_drop_replication_slot('test2_slot');
DROP TABLE test;
CREATE TABLE test (id int primary key, v varchar);
SELECT * FROM pg_create_logical_replication_slot('test1_slot',
'test_decoding');
SELECT * FROM pg_create_logical_replication_slot('test2_slot',
'test_decoding');
EOF
./get_loop.sh &
psql --quiet --tuples-only -o out1  << EOF
INSERT INTO  test SELECT i, (i*3)::varchar FROM generate_series(1,100000)
i;
SELECT * FROM pg_logical_slot_get_changes('test1_slot', NULL, NULL);
EOF
cat --squeeze-blank out1 | wc -l

Script 2 (get_loop.sh) continuously polls for changes using a loop of
pg_logical_slot_get_changes, using the second replication slot. Since the
test pg database is idle, only one call returns.

for i in `seq 1 10000`; do
  echo "SELECT * FROM pg_logical_slot_get_changes('test2_slot', NULL,
NULL);"
done | psql --quiet --tuples-only -o out2
cat --squeeze-blank out2 | wc -l

However, I get 116K or 148K rows in the output (after the 100K inserts).
I can clearly see the the output jumps back after 16K or 48K rows - starting
to stream the transaction from the beginning (see the first column - the
offsets):
...
 1/68F6E1B8 | 450854 | table public.test: INSERT: id[integer]:49149
v[character varying]:'147447'
 1/68F6E248 | 450854 | table public.test: INSERT: id[integer]:49150
v[character varying]:'147450'
 1/68F6E2D8 | 450854 | table public.test: INSERT: id[integer]:49151
v[character varying]:'147453'
 1/68F6E368 | 450854 | table public.test: INSERT: id[integer]:49152
v[character varying]:'147456'
 1/68891010 | 450854 | table public.test: INSERT: id[integer]:1 v[character
varying]:'3'
 1/688910D8 | 450854 | table public.test: INSERT: id[integer]:2 v[character
varying]:'6'
 1/68891168 | 450854 | table public.test: INSERT: id[integer]:3 v[character
varying]:'9'
 1/688911F8 | 450854 | table public.test: INSERT: id[integer]:4 v[character
varying]:'12'
...

When I have larger transaction (inserting 1M or 10M rows, it gets worse,
jumping back to the beginning multiple times.

BTW - I don't know if it is a bug in pg_logical_slot_get_changes or in the
test_decoding output plugin that I'm using to evaluate logical decoding.

Any fix / workaround?

Thanks in advance,
   Ofir

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

"Shulgin, Oleksandr"

Date:

26 October 2015, 09:43:35

On Sun, Oct 25, 2015 at 8:14 PM, <ofir.manor@gmail.com> wrote:

> The following bug has been logged on the website:
>
> Bug reference:      13725
> Logged by:          Ofir Manor
> Email address:      ofir.manor@gmail.com
> PostgreSQL version: 9.4.5
> Operating system:   CentOS 7.1 64-bit
> Description:
>
> Nasty but easily reproducible.
>
> Test case, script 1.
> I clean up the previous test, create a new table and two replication slot.
> Then, I run a script in the background (it will trigger the bug), insert
> 100K rows, try to get them with pg_logical_slot_get_changes from the first
> slot.
> This works correctly every time - I get back the 100,000 rows + one BEGIN +
> one COMMIT.
> psql --quiet  << EOF
> SELECT pg_drop_replication_slot('test1_slot');
> SELECT pg_drop_replication_slot('test2_slot');
> DROP TABLE test;
> CREATE TABLE test (id int primary key, v varchar);
> SELECT * FROM pg_create_logical_replication_slot('test1_slot',
> 'test_decoding');
> SELECT * FROM pg_create_logical_replication_slot('test2_slot',
> 'test_decoding');
> EOF
> ./get_loop.sh &
> psql --quiet --tuples-only -o out1  << EOF
> INSERT INTO  test SELECT i, (i*3)::varchar FROM generate_series(1,100000)
> i;
> SELECT * FROM pg_logical_slot_get_changes('test1_slot', NULL, NULL);
> EOF
> cat --squeeze-blank out1 | wc -l
>
> Script 2 (get_loop.sh) continuously polls for changes using a loop of
> pg_logical_slot_get_changes, using the second replication slot. Since the
> test pg database is idle, only one call returns.
>
> for i in `seq 1 10000`; do
>   echo "SELECT * FROM pg_logical_slot_get_changes('test2_slot', NULL,
> NULL);"
> done | psql --quiet --tuples-only -o out2
> cat --squeeze-blank out2 | wc -l
>
> However, I get 116K or 148K rows in the output (after the 100K inserts).
> I can clearly see the the output jumps back after 16K or 48K rows -
> starting
> to stream the transaction from the beginning (see the first column - the
> offsets):
> ...
>  1/68F6E1B8 | 450854 | table public.test: INSERT: id[integer]:49149
> v[character varying]:'147447'
>  1/68F6E248 | 450854 | table public.test: INSERT: id[integer]:49150
> v[character varying]:'147450'
>  1/68F6E2D8 | 450854 | table public.test: INSERT: id[integer]:49151
> v[character varying]:'147453'
>  1/68F6E368 | 450854 | table public.test: INSERT: id[integer]:49152
> v[character varying]:'147456'
>  1/68891010 | 450854 | table public.test: INSERT: id[integer]:1 v[character
> varying]:'3'
>  1/688910D8 | 450854 | table public.test: INSERT: id[integer]:2 v[character
> varying]:'6'
>  1/68891168 | 450854 | table public.test: INSERT: id[integer]:3 v[character
> varying]:'9'
>  1/688911F8 | 450854 | table public.test: INSERT: id[integer]:4 v[character
> varying]:'12'
> ...
>
> When I have larger transaction (inserting 1M or 10M rows, it gets worse,
> jumping back to the beginning multiple times.
>
> BTW - I don't know if it is a bug in pg_logical_slot_get_changes or in the
> test_decoding output plugin that I'm using to evaluate logical decoding.
>
> Any fix / workaround?
>

I didn't try it, but looks like this part of code in logicalfuncs.c is to
blame:

static Datum
pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
bool binary)
{
[snip]

/* compute the current end-of-wal */
if (!RecoveryInProgress())
end_of_wal = GetFlushRecPtr();
else
end_of_wal = GetXLogReplayRecPtr(NULL);

[snip]
while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
(ctx->reader->EndRecPtr && ctx->reader->EndRecPtr < end_of_wal))
{

So we compute end_of_wal before entering the loop, but the WAL keeps
growing as we read through it.

If we do it correctly, there's potential that the loop will never finish if
the WAL grows faster than we can decode it.  Shouldn't we also try to
re-write this SRF to support SFRM_ValuePerCall?

Another (quite minor) problem I see is this:

--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -301,7 +301,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo
fcinfo, bool confirm, bool bin
  upto_lsn = PG_GETARG_LSN(1);

  if (PG_ARGISNULL(2))
- upto_nchanges = InvalidXLogRecPtr;
+ upto_nchanges = 0;
  else
  upto_nchanges = PG_GETARG_INT32(2);

--
Alex

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

Andres Freund

Date:

26 October 2015, 11:05:38

On 2015-10-26 10:43:08 +0100, Shulgin, Oleksandr wrote:
> > Test case, script 1.
> > I clean up the previous test, create a new table and two replication slot.
> > Then, I run a script in the background (it will trigger the bug), insert
> > 100K rows, try to get them with pg_logical_slot_get_changes from the first
> > slot.
> > This works correctly every time - I get back the 100,000 rows + one BEGIN +
> > one COMMIT.
> > psql --quiet  << EOF
> > SELECT pg_drop_replication_slot('test1_slot');
> > SELECT pg_drop_replication_slot('test2_slot');
> > DROP TABLE test;
> > CREATE TABLE test (id int primary key, v varchar);
> > SELECT * FROM pg_create_logical_replication_slot('test1_slot',
> > 'test_decoding');
> > SELECT * FROM pg_create_logical_replication_slot('test2_slot',
> > 'test_decoding');
> > EOF
> > ./get_loop.sh &
> > psql --quiet --tuples-only -o out1  << EOF
> > INSERT INTO  test SELECT i, (i*3)::varchar FROM generate_series(1,100000)
> > i;
> > SELECT * FROM pg_logical_slot_get_changes('test1_slot', NULL, NULL);
> > EOF
> > cat --squeeze-blank out1 | wc -l
> >
> > Script 2 (get_loop.sh) continuously polls for changes using a loop of
> > pg_logical_slot_get_changes, using the second replication slot. Since the
> > test pg database is idle, only one call returns.

If this is anything more than a reproducer: don't do that, use the
streaming protocol.

> > for i in `seq 1 10000`; do
> >   echo "SELECT * FROM pg_logical_slot_get_changes('test2_slot', NULL,
> > NULL);"
> > done | psql --quiet --tuples-only -o out2
> > cat --squeeze-blank out2 | wc -l
> >
> > However, I get 116K or 148K rows in the output (after the 100K inserts).
> > I can clearly see the the output jumps back after 16K or 48K rows -
> > starting
> > to stream the transaction from the beginning (see the first column - the
> > offsets):
> > ...
> >  1/68F6E1B8 | 450854 | table public.test: INSERT: id[integer]:49149
> > v[character varying]:'147447'
> >  1/68F6E248 | 450854 | table public.test: INSERT: id[integer]:49150
> > v[character varying]:'147450'
> >  1/68F6E2D8 | 450854 | table public.test: INSERT: id[integer]:49151
> > v[character varying]:'147453'
> >  1/68F6E368 | 450854 | table public.test: INSERT: id[integer]:49152
> > v[character varying]:'147456'
> >  1/68891010 | 450854 | table public.test: INSERT: id[integer]:1 v[character
> > varying]:'3'
> >  1/688910D8 | 450854 | table public.test: INSERT: id[integer]:2 v[character
> > varying]:'6'
> >  1/68891168 | 450854 | table public.test: INSERT: id[integer]:3 v[character
> > varying]:'9'
> >  1/688911F8 | 450854 | table public.test: INSERT: id[integer]:4 v[character
> > varying]:'12'
> > ...
> >
> > When I have larger transaction (inserting 1M or 10M rows, it gets worse,
> > jumping back to the beginning multiple times.
> >
> > BTW - I don't know if it is a bug in pg_logical_slot_get_changes or in the
> > test_decoding output plugin that I'm using to evaluate logical decoding.
> >
> > Any fix / workaround?
> >
>
> I didn't try it, but looks like this part of code in logicalfuncs.c is to
> blame:
>
> static Datum
> pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
> bool binary)
> {
> [snip]
>
> /* compute the current end-of-wal */
> if (!RecoveryInProgress())
> end_of_wal = GetFlushRecPtr();
> else
> end_of_wal = GetXLogReplayRecPtr(NULL);
>
> [snip]
> while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
> (ctx->reader->EndRecPtr && ctx->reader->EndRecPtr < end_of_wal))
> {
>
> So we compute end_of_wal before entering the loop, but the WAL keeps
> growing as we read through it.

So?

> If we do it correctly, there's potential that the loop will never finish if
> the WAL grows faster than we can decode it.

> Shouldn't we also try to re-write this SRF to support
> SFRM_ValuePerCall?

Why? ValuePercall gets materialized into a tuplestore as well, unless
you call it from the select list.

> Another (quite minor) problem I see is this:
>
> --- a/src/backend/replication/logical/logicalfuncs.c
> +++ b/src/backend/replication/logical/logicalfuncs.c
> @@ -301,7 +301,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo
> fcinfo, bool confirm, bool bin
>   upto_lsn = PG_GETARG_LSN(1);
>
>   if (PG_ARGISNULL(2))
> - upto_nchanges = InvalidXLogRecPtr;
> + upto_nchanges = 0;
>   else
>   upto_nchanges = PG_GETARG_INT32(2);

Ugh, yes.

Greetings,

Andres Freund

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

"Shulgin, Oleksandr"

Date:

26 October 2015, 11:17:05

On Mon, Oct 26, 2015 at 12:05 PM, Andres Freund <andres@anarazel.de> wrote:

> >
> > So we compute end_of_wal before entering the loop, but the WAL keeps
> > growing as we read through it.
>
> So?
>
> > If we do it correctly, there's potential that the loop will never finish
> if
> > the WAL grows faster than we can decode it.
>
> > Shouldn't we also try to re-write this SRF to support
> > SFRM_ValuePerCall?
>
> Why? ValuePercall gets materialized into a tuplestore as well, unless
> you call it from the select list.
>

Ah, yes: good catch.  Then it makes less sense to change this indeed.

--
Alex

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

Andres Freund

Date:

26 October 2015, 11:30:13

On 2015-10-26 13:21:44 +0200, ofir.manor@gmail.com wrote:
> Yes, this is a small script to reproduce, the real code is Java, we saw
> sporadic wrong results.
> However, I'm interested in CDC (get change notifications per row to my
> app), not PG-to-PG replication.

The streaming interface exists for row-by-row notification as well, and
is a *LOT* more efficient.

If there's a bug here, we obviously need to fix it nonetheless.

Greetings,

Andres Freund

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

"Shulgin, Oleksandr"

Date:

26 October 2015, 11:35:11

On Mon, Oct 26, 2015 at 12:30 PM, Andres Freund <andres@anarazel.de> wrote:

> On 2015-10-26 13:21:44 +0200, ofir.manor@gmail.com wrote:
> > Yes, this is a small script to reproduce, the real code is Java, we saw
> > sporadic wrong results.
> > However, I'm interested in CDC (get change notifications per row to my
> > app), not PG-to-PG replication.
>
> The streaming interface exists for row-by-row notification as well, and
> is a *LOT* more efficient.
>

Yeah, but I don't think there's a workable Java implementation available
yet?

If there's a bug here, we obviously need to fix it nonetheless.
>

I would assume re-calculating end_of_wal in the while loop condition would
fix this?  One can obviously run out of memory with a more WAL-intensive
example, but I would not consider that a bug of this interface.

--
Alex

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

Andres Freund

Date:

26 October 2015, 11:39:05

On 2015-10-26 12:34:48 +0100, Shulgin, Oleksandr wrote:
> On Mon, Oct 26, 2015 at 12:30 PM, Andres Freund <andres@anarazel.de> wrote:
>
> > On 2015-10-26 13:21:44 +0200, ofir.manor@gmail.com wrote:
> > > Yes, this is a small script to reproduce, the real code is Java, we saw
> > > sporadic wrong results.
> > > However, I'm interested in CDC (get change notifications per row to my
> > > app), not PG-to-PG replication.
> >
> > The streaming interface exists for row-by-row notification as well, and
> > is a *LOT* more efficient.
> >
>
> Yeah, but I don't think there's a workable Java implementation available
> yet?

No idea, but it's not that hard to write one.

> > If there's a bug here, we obviously need to fix it nonetheless.
> I would assume re-calculating end_of_wal in the while loop condition would
> fix this?

Why? That'd just lead to outputting more rows in one invocation, and
that's it? I think I'm not following what you see as the problem?

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

"Shulgin, Oleksandr"

Date:

26 October 2015, 11:51:27

On Mon, Oct 26, 2015 at 12:38 PM, Andres Freund <andres@anarazel.de> wrote:

> On 2015-10-26 12:34:48 +0100, Shulgin, Oleksandr wrote:
> > On Mon, Oct 26, 2015 at 12:30 PM, Andres Freund <andres@anarazel.de>
> wrote:
> >
> > > On 2015-10-26 13:21:44 +0200, ofir.manor@gmail.com wrote:
> > > > Yes, this is a small script to reproduce, the real code is Java, we
> saw
> > > > sporadic wrong results.
> > > > However, I'm interested in CDC (get change notifications per row to
> my
> > > > app), not PG-to-PG replication.
> > >
> > > The streaming interface exists for row-by-row notification as well, and
> > > is a *LOT* more efficient.
> > >
> >
> > Yeah, but I don't think there's a workable Java implementation available
> > yet?
>
> No idea, but it's not that hard to write one.
>
> > > If there's a bug here, we obviously need to fix it nonetheless.
> > I would assume re-calculating end_of_wal in the while loop condition
> would
> > fix this?
>
> Why? That'd just lead to outputting more rows in one invocation, and
> that's it? I think I'm not following what you see as the problem?
>

I think there are just some false expectations involved about how this
interface should work.  The OP likely expects that after the partial
results were returned by the first call to pg_logical_slot_get_changes(),
the next call will continue from the point where the first call left.

This doesn't happen because in the first call we never cross transaction
boundary?  Hm, but why do we see the partial changes anyway?  I would
assume if we started decoding this at all, the transaction was already
committed and end_of_wal will be past its end...

I'm lost.

--
Alex

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

Andres Freund

Date:

26 October 2015, 12:39:40

Hi,

On 2015-10-26 14:09:24 +0200, ofir.manor@gmail.com wrote:
> Regarding my expectations...
> 1. I have observed that pg_logical_slot_get_changes always return changes
> at the transaction boundary.

Right, it's documented to do so.

> For example, asking for 1 change (third parameter, replace NULL a number)
> can still return million of changes (if the next transaction is big), as
> the check is likely done at the end of each transaction.
> I'm actually OK with that (weird but doesn't hurt me).

The reason for that is that we can't return changes until they're
committed, and that we currently can't specify anything but transaction
boundaries as "progress" point.

> 2. I have validated that a single call to pg_logical_slot_get_changes
> returns a result set with duplicates, going back to the start (I've seen it
> with a Java debugger, looping over the forward-only cursor of the SELECT
> from the replication slot). That is the bug I'm reporting - not across
> calls but within a single call.

That'd be rather weird. I'll look into it, but due to the pgconf.eu
conference this week I can't promise I'll get to it this week.

> 3. I also don't understand the "end of WAL" thing, since the call must wait
> to see if the transaction commits or rolls back.

The SQL interface doesn't ever block waiting for WAL, it decodes as far
as it can with the existing data.

> Maybe if it reaches "end of WAL", it starts over internally, adding the
> same changes again to some internal buffer that later gets sent?

No.

Andres

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

Date:

26 October 2015, 15:15:50

>
>
> > 2. I have validated that a single call to pg_logical_slot_get_changes
> > returns a result set with duplicates, going back to the start (I've seen
> it
> > with a Java debugger, looping over the forward-only cursor of the SELECT
> > from the replication slot). That is the bug I'm reporting - not across
> > calls but within a single call.
>
> That'd be rather weird. I'll look into it, but due to the pgconf.eu
> conference this week I can't promise I'll get to it this week.
>
>
> Andres
>


Sure, triage can wait, make the most out of the conference and thanks for
the fast replies!
I did update the test case to prove that all output comes from a single
call. I added a different constant to each SQL query - the bash loop
iteration number:

for i in `seq 1 10000`; do
  echo SELECT $i ",v.* FROM pg_logical_slot_get_changes('test2_slot', NULL,
NULL) v;"
done | psql --quiet --tuples-only | cat --squeeze-blank > out2
cat out2 | wc -l

You can see all output comes with the same SQL (in this case, statement
419).
      419 | 1/76344D18 | 450880 | BEGIN 450880
      419 | 1/76344D18 | 450880 | table public.test: INSERT: id[integer]:1
v[character varying]:'3'
      419 | 1/76344DE0 | 450880 | table public.test: INSERT: id[integer]:2
v[character varying]:'6'
...
      419 | 1/76BD9348 | 450880 | table public.test: INSERT:
id[integer]:61439 v[character varying]:'184317'
      419 | 1/76BD93D8 | 450880 | table public.test: INSERT:
id[integer]:61440 v[character varying]:'184320'
      419 | 1/76344D18 | 450880 | table public.test: INSERT: id[integer]:1
v[character varying]:'3'
      419 | 1/76344DE0 | 450880 | table public.test: INSERT: id[integer]:2
v[character varying]:'6'
...

      419 | 1/7713BEB0 | 450880 | table public.test: INSERT:
id[integer]:99999 v[character varying]:'299997'
      419 | 1/7713BF40 | 450880 | table public.test: INSERT:
id[integer]:100000 v[character varying]:'300000'
      419 | 1/7713C028 | 450880 | COMMIT 450880

--

Ofir Manor

   Blog: http://ofirm.wordpress.com  <http://ofirm.wordpress.com>
LinkedIn: http://il.linkedin.com/in/ofirmanor

   Twitter: @ofirm   Mobile:   +972-54-7801286

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

Date:

26 October 2015, 15:15:57

Regarding my expectations...
1. I have observed that pg_logical_slot_get_changes always return changes
at the transaction boundary.
For example, asking for 1 change (third parameter, replace NULL a number)
can still return million of changes (if the next transaction is big), as
the check is likely done at the end of each transaction.
I'm actually OK with that (weird but doesn't hurt me).
2. I have validated that a single call to pg_logical_slot_get_changes
returns a result set with duplicates, going back to the start (I've seen it
with a Java debugger, looping over the forward-only cursor of the SELECT
from the replication slot). That is the bug I'm reporting - not across
calls but within a single call.
3. I also don't understand the "end of WAL" thing, since the call must wait
to see if the transaction commits or rolls back.
Maybe if it reaches "end of WAL", it starts over internally, adding the
same changes again to some internal buffer that later gets sent?
I can also add that it starts over after a "round" number of rows - I saw
8K, 52K, 64K etc (run the test case). Maybe that is a hint (or just an
artifact of how INSERT AS SELECT works?)

On Mon, Oct 26, 2015 at 1:51 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

> On Mon, Oct 26, 2015 at 12:38 PM, Andres Freund <andres@anarazel.de>
> wrote:
>
>> On 2015-10-26 12:34:48 +0100, Shulgin, Oleksandr wrote:
>> > On Mon, Oct 26, 2015 at 12:30 PM, Andres Freund <andres@anarazel.de>
>> wrote:
>> >
>> > > On 2015-10-26 13:21:44 +0200, ofir.manor@gmail.com wrote:
>> > > > Yes, this is a small script to reproduce, the real code is Java, we
>> saw
>> > > > sporadic wrong results.
>> > > > However, I'm interested in CDC (get change notifications per row to
>> my
>> > > > app), not PG-to-PG replication.
>> > >
>> > > The streaming interface exists for row-by-row notification as well,
>> and
>> > > is a *LOT* more efficient.
>> > >
>> >
>> > Yeah, but I don't think there's a workable Java implementation available
>> > yet?
>>
>> No idea, but it's not that hard to write one.
>>
>> > > If there's a bug here, we obviously need to fix it nonetheless.
>> > I would assume re-calculating end_of_wal in the while loop condition
>> would
>> > fix this?
>>
>> Why? That'd just lead to outputting more rows in one invocation, and
>> that's it? I think I'm not following what you see as the problem?
>>
>
> I think there are just some false expectations involved about how this
> interface should work.  The OP likely expects that after the partial
> results were returned by the first call to pg_logical_slot_get_changes(),
> the next call will continue from the point where the first call left.
>
> This doesn't happen because in the first call we never cross transaction
> boundary?  Hm, but why do we see the partial changes anyway?  I would
> assume if we started decoding this at all, the transaction was already
> committed and end_of_wal will be past its end...
>
> I'm lost.
>
> --
> Alex
>
>

--

Ofir Manor

   Blog: http://ofirm.wordpress.com  <http://ofirm.wordpress.com>
LinkedIn: http://il.linkedin.com/in/ofirmanor

   Twitter: @ofirm   Mobile:   +972-54-7801286

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

Date:

26 October 2015, 15:16:13

Yes, this is a small script to reproduce, the real code is Java, we saw
sporadic wrong results.
However, I'm interested in CDC (get change notifications per row to my
app), not PG-to-PG replication.
Hope it makes sense,
  Ofir

On Mon, Oct 26, 2015 at 1:05 PM, Andres Freund <andres@anarazel.de> wrote:

> On 2015-10-26 10:43:08 +0100, Shulgin, Oleksandr wrote:
> > > Test case, script 1.
> > > I clean up the previous test, create a new table and two replication
> slot.
> > > Then, I run a script in the background (it will trigger the bug),
> insert
> > > 100K rows, try to get them with pg_logical_slot_get_changes from the
> first
> > > slot.
> > > This works correctly every time - I get back the 100,000 rows + one
> BEGIN +
> > > one COMMIT.
> > > psql --quiet  << EOF
> > > SELECT pg_drop_replication_slot('test1_slot');
> > > SELECT pg_drop_replication_slot('test2_slot');
> > > DROP TABLE test;
> > > CREATE TABLE test (id int primary key, v varchar);
> > > SELECT * FROM pg_create_logical_replication_slot('test1_slot',
> > > 'test_decoding');
> > > SELECT * FROM pg_create_logical_replication_slot('test2_slot',
> > > 'test_decoding');
> > > EOF
> > > ./get_loop.sh &
> > > psql --quiet --tuples-only -o out1  << EOF
> > > INSERT INTO  test SELECT i, (i*3)::varchar FROM
> generate_series(1,100000)
> > > i;
> > > SELECT * FROM pg_logical_slot_get_changes('test1_slot', NULL, NULL);
> > > EOF
> > > cat --squeeze-blank out1 | wc -l
> > >
> > > Script 2 (get_loop.sh) continuously polls for changes using a loop of
> > > pg_logical_slot_get_changes, using the second replication slot. Since
> the
> > > test pg database is idle, only one call returns.
>
> If this is anything more than a reproducer: don't do that, use the
> streaming protocol.
>
> > > for i in `seq 1 10000`; do
> > >   echo "SELECT * FROM pg_logical_slot_get_changes('test2_slot', NULL,
> > > NULL);"
> > > done | psql --quiet --tuples-only -o out2
> > > cat --squeeze-blank out2 | wc -l
> > >
> > > However, I get 116K or 148K rows in the output (after the 100K
> inserts).
> > > I can clearly see the the output jumps back after 16K or 48K rows -
> > > starting
> > > to stream the transaction from the beginning (see the first column -
> the
> > > offsets):
> > > ...
> > >  1/68F6E1B8 | 450854 | table public.test: INSERT: id[integer]:49149
> > > v[character varying]:'147447'
> > >  1/68F6E248 | 450854 | table public.test: INSERT: id[integer]:49150
> > > v[character varying]:'147450'
> > >  1/68F6E2D8 | 450854 | table public.test: INSERT: id[integer]:49151
> > > v[character varying]:'147453'
> > >  1/68F6E368 | 450854 | table public.test: INSERT: id[integer]:49152
> > > v[character varying]:'147456'
> > >  1/68891010 | 450854 | table public.test: INSERT: id[integer]:1
> v[character
> > > varying]:'3'
> > >  1/688910D8 | 450854 | table public.test: INSERT: id[integer]:2
> v[character
> > > varying]:'6'
> > >  1/68891168 | 450854 | table public.test: INSERT: id[integer]:3
> v[character
> > > varying]:'9'
> > >  1/688911F8 | 450854 | table public.test: INSERT: id[integer]:4
> v[character
> > > varying]:'12'
> > > ...
> > >
> > > When I have larger transaction (inserting 1M or 10M rows, it gets
> worse,
> > > jumping back to the beginning multiple times.
> > >
> > > BTW - I don't know if it is a bug in pg_logical_slot_get_changes or in
> the
> > > test_decoding output plugin that I'm using to evaluate logical
> decoding.
> > >
> > > Any fix / workaround?
> > >
> >
> > I didn't try it, but looks like this part of code in logicalfuncs.c is to
> > blame:
> >
> > static Datum
> > pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
> > bool binary)
> > {
> > [snip]
> >
> > /* compute the current end-of-wal */
> > if (!RecoveryInProgress())
> > end_of_wal = GetFlushRecPtr();
> > else
> > end_of_wal = GetXLogReplayRecPtr(NULL);
> >
> > [snip]
> > while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
> > (ctx->reader->EndRecPtr && ctx->reader->EndRecPtr < end_of_wal))
> > {
> >
> > So we compute end_of_wal before entering the loop, but the WAL keeps
> > growing as we read through it.
>
> So?
>
> > If we do it correctly, there's potential that the loop will never finish
> if
> > the WAL grows faster than we can decode it.
>
> > Shouldn't we also try to re-write this SRF to support
> > SFRM_ValuePerCall?
>
> Why? ValuePercall gets materialized into a tuplestore as well, unless
> you call it from the select list.
>
> > Another (quite minor) problem I see is this:
> >
> > --- a/src/backend/replication/logical/logicalfuncs.c
> > +++ b/src/backend/replication/logical/logicalfuncs.c
> > @@ -301,7 +301,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo
> > fcinfo, bool confirm, bool bin
> >   upto_lsn = PG_GETARG_LSN(1);
> >
> >   if (PG_ARGISNULL(2))
> > - upto_nchanges = InvalidXLogRecPtr;
> > + upto_nchanges = 0;
> >   else
> >   upto_nchanges = PG_GETARG_INT32(2);
>
> Ugh, yes.
>
> Greetings,
>
> Andres Freund
>



--

Ofir Manor

   Blog: http://ofirm.wordpress.com  <http://ofirm.wordpress.com>
LinkedIn: http://il.linkedin.com/in/ofirmanor

   Twitter: @ofirm   Mobile:   +972-54-7801286

Re: BUG #13725: Logical Decoding - wrong results with large transactions and unfortunate timing

From

Date:

15 November 2015, 19:03:32

Hi Andreas,
any update?
To add another data point, I found that the "concurrent" run (get from
test2_slot) is also running much slower than the get that starts after the
insert commits (get from test1_slot).
You can amplify this by going from an insert of a 100K rows to 1M rows.,
     Ofir


On Mon, Oct 26, 2015 at 3:05 PM, <ofir.manor@gmail.com> wrote:

>
>> > 2. I have validated that a single call to pg_logical_slot_get_changes
>> > returns a result set with duplicates, going back to the start (I've
>> seen it
>> > with a Java debugger, looping over the forward-only cursor of the SELECT
>> > from the replication slot). That is the bug I'm reporting - not across
>> > calls but within a single call.
>>
>> That'd be rather weird. I'll look into it, but due to the pgconf.eu
>> conference this week I can't promise I'll get to it this week.
>>
>>
>> Andres
>>
>
>
> Sure, triage can wait, make the most out of the conference and thanks for
> the fast replies!
> I did update the test case to prove that all output comes from a single
> call. I added a different constant to each SQL query - the bash loop
> iteration number:
>
> for i in `seq 1 10000`; do
>   echo SELECT $i ",v.* FROM pg_logical_slot_get_changes('test2_slot',
> NULL, NULL) v;"
> done | psql --quiet --tuples-only | cat --squeeze-blank > out2
> cat out2 | wc -l
>
> You can see all output comes with the same SQL (in this case, statement
> 419).
>       419 | 1/76344D18 | 450880 | BEGIN 450880
>       419 | 1/76344D18 | 450880 | table public.test: INSERT: id[integer]:1
> v[character varying]:'3'
>       419 | 1/76344DE0 | 450880 | table public.test: INSERT: id[integer]:2
> v[character varying]:'6'
> ...
>       419 | 1/76BD9348 | 450880 | table public.test: INSERT:
> id[integer]:61439 v[character varying]:'184317'
>       419 | 1/76BD93D8 | 450880 | table public.test: INSERT:
> id[integer]:61440 v[character varying]:'184320'
>       419 | 1/76344D18 | 450880 | table public.test: INSERT: id[integer]:1
> v[character varying]:'3'
>       419 | 1/76344DE0 | 450880 | table public.test: INSERT: id[integer]:2
> v[character varying]:'6'
> ...
>
>       419 | 1/7713BEB0 | 450880 | table public.test: INSERT:
> id[integer]:99999 v[character varying]:'299997'
>       419 | 1/7713BF40 | 450880 | table public.test: INSERT:
> id[integer]:100000 v[character varying]:'300000'
>       419 | 1/7713C028 | 450880 | COMMIT 450880
>
> --
>
> Ofir Manor
>
>    Blog: http://ofirm.wordpress.com  <http://ofirm.wordpress.com>
> LinkedIn: http://il.linkedin.com/in/ofirmanor
>
>    Twitter: @ofirm   Mobile:   +972-54-7801286
>



--

Ofir Manor

   Blog: http://ofirm.wordpress.com  <http://ofirm.wordpress.com>
LinkedIn: http://il.linkedin.com/in/ofirmanor

   Twitter: @ofirm   Mobile:   +972-54-7801286