Thread: [HACKERS] logical replication deranged sender

[HACKERS] logical replication deranged sender

From
Jeff Janes
Date:
After dropping a subscription, it says it succeeded and that it dropped the slot on the publisher.

But the publisher still has the slot, and a full-tilt process described by ps as 

postgres: wal sender process jjanes [local] idle in transaction

Strace shows that this process is doing nothing but opening, reading, lseek, and closing from pg_wal, and calling sbrk.  It never sends anything.

This is not how it should work, correct?

I'm running pgbench -c4 -j8 on the publisher, publishing all pgbench tables.

It doesn't seem to happen every time I create and then drop a subscription, but it happens pretty often.

The subscribing server's logs shows:

21270  2017-05-07 16:04:16.517 PDT LOG:  process 21270 still waiting for AccessShareLock on relation 6100 of database 0 after 1000.051 ms
21270  2017-05-07 16:04:16.517 PDT DETAIL:  Process holding the lock: 25493. Wait queue: 21270.
21270  2017-05-07 16:04:16.520 PDT LOG:  process 21270 acquired AccessShareLock on relation 6100 of database 0 after 1003.608 ms
25493 DROP SUBSCRIPTION 2017-05-07 16:04:16.520 PDT LOG:  duration: 1005.176 ms CPU: user: 0.00 s, system: 0.00 s, elapsed: 1.00 s   statement: drop subscription foobar ;

The publishing server's logs doesn't show anything relevant.

I'm using 86713de,  I have not tried this in earlier commits.

Cheers,

Jeff

Re: [HACKERS] logical replication deranged sender

From
Petr Jelinek
Date:
On 08/05/17 01:17, Jeff Janes wrote:
> After dropping a subscription, it says it succeeded and that it dropped
> the slot on the publisher.
> 
> But the publisher still has the slot, and a full-tilt process described
> by ps as 
> 
> postgres: wal sender process jjanes [local] idle in transaction
> 
> Strace shows that this process is doing nothing but opening, reading,
> lseek, and closing from pg_wal, and calling sbrk.  It never sends anything.
> 
> This is not how it should work, correct?
> 

No, and I don't see how this happens though, we only report success if
the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
don't see anything in source that would explain this. I will need to
reproduce it first to see what's happening (wasn't able to do that yet,
but it might just need more time since you say it does no happen always).

> 
> The subscribing server's logs shows:
> 
> 21270  2017-05-07 16:04:16.517 PDT LOG:  process 21270 still waiting for
> AccessShareLock on relation 6100 of database 0 after 1000.051 ms
> 21270  2017-05-07 16:04:16.517 PDT DETAIL:  Process holding the lock:
> 25493. Wait queue: 21270.
> 21270  2017-05-07 16:04:16.520 PDT LOG:  process 21270 acquired
> AccessShareLock on relation 6100 of database 0 after 1003.608 ms
> 25493 DROP SUBSCRIPTION 2017-05-07 16:04:16.520 PDT LOG:  duration:
> 1005.176 ms CPU: user: 0.00 s, system: 0.00 s, elapsed: 1.00 s  
> statement: drop subscription foobar ;
> 

Yeah that's normal because DropSubscription holds exclusive lock on
pg_subscription. I am guessing 21270 is the logical replication launcher?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication deranged sender

From
Petr Jelinek
Date:
On 08/05/17 13:47, Petr Jelinek wrote:
> On 08/05/17 01:17, Jeff Janes wrote:
>> After dropping a subscription, it says it succeeded and that it dropped
>> the slot on the publisher.
>>
>> But the publisher still has the slot, and a full-tilt process described
>> by ps as 
>>
>> postgres: wal sender process jjanes [local] idle in transaction
>>
>> Strace shows that this process is doing nothing but opening, reading,
>> lseek, and closing from pg_wal, and calling sbrk.  It never sends anything.
>>
>> This is not how it should work, correct?
>>
> 
> No, and I don't see how this happens though, we only report success if
> the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
> don't see anything in source that would explain this. I will need to
> reproduce it first to see what's happening (wasn't able to do that yet,
> but it might just need more time since you say it does no happen always).
> 

Hm I wonder are there any workers left on subscriber when this happens?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication deranged sender

From
Jeff Janes
Date:
On Tue, May 9, 2017 at 9:18 AM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
On 08/05/17 13:47, Petr Jelinek wrote:
> On 08/05/17 01:17, Jeff Janes wrote:
>> After dropping a subscription, it says it succeeded and that it dropped
>> the slot on the publisher.
>>
>> But the publisher still has the slot, and a full-tilt process described
>> by ps as
>>
>> postgres: wal sender process jjanes [local] idle in transaction
>>
>> Strace shows that this process is doing nothing but opening, reading,
>> lseek, and closing from pg_wal, and calling sbrk.  It never sends anything.
>>
>> This is not how it should work, correct?
>>
>
> No, and I don't see how this happens though, we only report success if
> the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
> don't see anything in source that would explain this. I will need to
> reproduce it first to see what's happening (wasn't able to do that yet,
> but it might just need more time since you say it does no happen always).
>

Hm I wonder are there any workers left on subscriber when this happens?

Yes.  using ps, I get this:

postgres: bgworker: logical replication worker for subscription 16408 sync 16391
postgres: bgworker: logical replication worker for subscription 16408 sync 16388

They seem to be permanently blocked on a socket to read from the publisher.

On the publisher side, I think it is very slowly assembling a snapshot.  It seems to be adding one xid at a time, and then re-sorting the entire list.  Over and over.

Cheers,

Jeff

Re: [HACKERS] logical replication deranged sender

From
Petr Jelinek
Date:
On 09/05/17 19:13, Jeff Janes wrote:
> On Tue, May 9, 2017 at 9:18 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com <mailto:petr.jelinek@2ndquadrant.com>> wrote:
> 
>     On 08/05/17 13:47, Petr Jelinek wrote:
>     > On 08/05/17 01:17, Jeff Janes wrote:
>     >> After dropping a subscription, it says it succeeded and that it dropped
>     >> the slot on the publisher.
>     >>
>     >> But the publisher still has the slot, and a full-tilt process described
>     >> by ps as
>     >>
>     >> postgres: wal sender process jjanes [local] idle in transaction
>     >>
>     >> Strace shows that this process is doing nothing but opening, reading,
>     >> lseek, and closing from pg_wal, and calling sbrk.  It never sends anything.
>     >>
>     >> This is not how it should work, correct?
>     >>
>     >
>     > No, and I don't see how this happens though, we only report success if
>     > the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
>     > don't see anything in source that would explain this. I will need to
>     > reproduce it first to see what's happening (wasn't able to do that yet,
>     > but it might just need more time since you say it does no happen always).
>     >
> 
>     Hm I wonder are there any workers left on subscriber when this happens?
> 
> 
> Yes.  using ps, I get this:
> 
> postgres: bgworker: logical replication worker for subscription 16408
> sync 16391
> postgres: bgworker: logical replication worker for subscription 16408
> sync 16388
> 
> They seem to be permanently blocked on a socket to read from the publisher.
> 
> On the publisher side, I think it is very slowly assembling a snapshot. 
> It seems to be adding one xid at a time, and then re-sorting the entire
> list.  Over and over.
> 

Okay, then it's the same issue Masahiko Sawada reported in nearby
thread, or at least has same cause.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] logical replication deranged sender

From
Michael Paquier
Date:
On Wed, May 10, 2017 at 3:06 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> Okay, then it's the same issue Masahiko Sawada reported in nearby
> thread, or at least has same cause.

This report is here:
https://www.postgresql.org/message-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com
It may be useful to keep all discussions on the other thread as that's
the first report.
-- 
Michael