Thread: [HACKERS] logical replication - possible remaining problem

[HACKERS] logical replication - possible remaining problem

From
Erik Rijkers
Date:
I am not sure whether what I found here amounts to a bug, I might be 
doing something dumb.

During the last few months I did tests by running pgbench over logical 
replication.  Earlier emails have details.

The basic form of that now works well (and the fix has been comitted) 
but as I looked over my testing program I noticed one change I made to 
it, already many weeks ago:

In the cleanup during startup (pre-flight check you might say) and also 
before the end, instead of
  echo "delete from pg_subscription;" | psql -qXp $port2     -- (1)

I changed that (as I say, many weeks ago) to:
  echo "delete from pg_subscription;        delete from pg_subscription_rel;        delete from pg_replication_origin;
"| psql -qXp $port2   -- (2)
 

This occurs (2x) inside the bash function clean_pubsub(), in main test 
script pgbench_detail2.sh

This change was an effort to ensure to arrive at a 'clean' start (and 
end-) state which would always be the same.

All my more recent testing (and that of Mark, I have to assume) was thus 
done with (2).

Now, looking at the script again I am thinking that it would be 
reasonable to expect that after issuing   delete from pg_subscription;

the other 2 tables are /also/ cleaned, automatically, as a consequence.  
(Is this reasonable? this is really the main question of this email).

So I removed the latter two delete statements again, and ran the tests 
again with the form in  (1)

I have established that (after a number of successful cycles) the test 
stops succeeding with in the replica log repetitions of:

2017-06-07 22:10:29.057 CEST [2421] LOG:  logical replication apply 
worker for subscription "sub1" has started
2017-06-07 22:10:29.057 CEST [2421] ERROR:  could not find free 
replication state slot for replication origin with OID 11
2017-06-07 22:10:29.057 CEST [2421] HINT:  Increase 
max_replication_slots and try again.
2017-06-07 22:10:29.058 CEST [2061] LOG:  worker process: logical 
replication worker for subscription 29235 (PID 2421) exited with exit 
code 1

when I manually 'clean up' by doing:   delete from pg_replication_origin;

then, and only then, does the session finish and succeed ('replica ok').

So to me it looks as if there is an omission of 
pg_replication_origin-cleanup when pg_description is deleted.

Does that make sense?  All this is probably vague and I am only posting 
in the hope that Petr (or someone else) perhaps immediately understands 
what goes wrong, with even his limited amount of info.

In the meantime I will try to dig up more detailed info...


thanks,


Erik Rijkers



Re: [HACKERS] logical replication - possible remaining problem

From
Alvaro Herrera
Date:
Erik Rijkers wrote:

> Now, looking at the script again I am thinking that it would be reasonable
> to expect that after issuing
>    delete from pg_subscription;
> 
> the other 2 tables are /also/ cleaned, automatically, as a consequence.  (Is
> this reasonable? this is really the main question of this email).

I don't think it's reasonable to expect that the system recovers
automatically from what amounts to catalog corruption.  You should be
using the DDL that removes subscriptions instead.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical replication - possible remaining problem

From
Erik Rijkers
Date:
On 2017-06-07 23:18, Alvaro Herrera wrote:
> Erik Rijkers wrote:
> 
>> Now, looking at the script again I am thinking that it would be 
>> reasonable
>> to expect that after issuing
>>    delete from pg_subscription;
>> 
>> the other 2 tables are /also/ cleaned, automatically, as a 
>> consequence.  (Is
>> this reasonable? this is really the main question of this email).
> 
> I don't think it's reasonable to expect that the system recovers
> automatically from what amounts to catalog corruption.  You should be
> using the DDL that removes subscriptions instead.

You're right, that makes sense.
Thanks.



Re: [HACKERS] logical replication - possible remaining problem

From
Petr Jelinek
Date:
Hi,

On 07/06/17 22:49, Erik Rijkers wrote:
> I am not sure whether what I found here amounts to a bug, I might be
> doing something dumb.
> 
> During the last few months I did tests by running pgbench over logical
> replication.  Earlier emails have details.
> 
> The basic form of that now works well (and the fix has been comitted)
> but as I looked over my testing program I noticed one change I made to
> it, already many weeks ago:
> 
> In the cleanup during startup (pre-flight check you might say) and also
> before the end, instead of
> 
>   echo "delete from pg_subscription;" | psql -qXp $port2     -- (1)
> 
> I changed that (as I say, many weeks ago) to:
> 
>   echo "delete from pg_subscription;
>         delete from pg_subscription_rel;
>         delete from pg_replication_origin; " | psql -qXp $port2   -- (2)
> 
> This occurs (2x) inside the bash function clean_pubsub(), in main test
> script pgbench_detail2.sh
> 
> This change was an effort to ensure to arrive at a 'clean' start (and
> end-) state which would always be the same.
> 
> All my more recent testing (and that of Mark, I have to assume) was thus
> done with (2).
> 
> Now, looking at the script again I am thinking that it would be
> reasonable to expect that after issuing
>    delete from pg_subscription;
> 
> the other 2 tables are /also/ cleaned, automatically, as a consequence. 
> (Is this reasonable? this is really the main question of this email).
> 

Hmm, they are not cleaned automatically, deleting from system catalogs
manually like this never propagates to related tables, we don't use FKs
there.

> So I removed the latter two delete statements again, and ran the tests
> again with the form in  (1)
> 
> I have established that (after a number of successful cycles) the test
> stops succeeding with in the replica log repetitions of:
> 
> 2017-06-07 22:10:29.057 CEST [2421] LOG:  logical replication apply
> worker for subscription "sub1" has started
> 2017-06-07 22:10:29.057 CEST [2421] ERROR:  could not find free
> replication state slot for replication origin with OID 11
> 2017-06-07 22:10:29.057 CEST [2421] HINT:  Increase
> max_replication_slots and try again.
> 2017-06-07 22:10:29.058 CEST [2061] LOG:  worker process: logical
> replication worker for subscription 29235 (PID 2421) exited with exit
> code 1
> 
> when I manually 'clean up' by doing:
>    delete from pg_replication_origin;
> 

Yeah because you consumed all the origins (I am still not huge fan of
how that limit works, but that's separate discussion).

> then, and only then, does the session finish and succeed ('replica ok').
> 
> So to me it looks as if there is an omission of
> pg_replication_origin-cleanup when pg_description is deleted.
> 

There is no omission, origin is not supposed to be deleted automatically
unless you use DROP SUBSCRIPTION.


--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services