Thread: [HACKERS] logical replication deranged sender
After dropping a subscription, it says it succeeded and that it dropped the slot on the publisher.
But the publisher still has the slot, and a full-tilt process described by ps as
postgres: wal sender process jjanes [local] idle in transaction
Strace shows that this process is doing nothing but opening, reading, lseek, and closing from pg_wal, and calling sbrk. It never sends anything.
This is not how it should work, correct?
I'm running pgbench -c4 -j8 on the publisher, publishing all pgbench tables.
It doesn't seem to happen every time I create and then drop a subscription, but it happens pretty often.
The subscribing server's logs shows:
21270 2017-05-07 16:04:16.517 PDT LOG: process 21270 still waiting for AccessShareLock on relation 6100 of database 0 after 1000.051 ms
21270 2017-05-07 16:04:16.517 PDT DETAIL: Process holding the lock: 25493. Wait queue: 21270.
21270 2017-05-07 16:04:16.520 PDT LOG: process 21270 acquired AccessShareLock on relation 6100 of database 0 after 1003.608 ms
25493 DROP SUBSCRIPTION 2017-05-07 16:04:16.520 PDT LOG: duration: 1005.176 ms CPU: user: 0.00 s, system: 0.00 s, elapsed: 1.00 s statement: drop subscription foobar ;
The publishing server's logs doesn't show anything relevant.
I'm using 86713de, I have not tried this in earlier commits.
Cheers,
Jeff
On 08/05/17 01:17, Jeff Janes wrote: > After dropping a subscription, it says it succeeded and that it dropped > the slot on the publisher. > > But the publisher still has the slot, and a full-tilt process described > by ps as > > postgres: wal sender process jjanes [local] idle in transaction > > Strace shows that this process is doing nothing but opening, reading, > lseek, and closing from pg_wal, and calling sbrk. It never sends anything. > > This is not how it should work, correct? > No, and I don't see how this happens though, we only report success if the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I don't see anything in source that would explain this. I will need to reproduce it first to see what's happening (wasn't able to do that yet, but it might just need more time since you say it does no happen always). > > The subscribing server's logs shows: > > 21270 2017-05-07 16:04:16.517 PDT LOG: process 21270 still waiting for > AccessShareLock on relation 6100 of database 0 after 1000.051 ms > 21270 2017-05-07 16:04:16.517 PDT DETAIL: Process holding the lock: > 25493. Wait queue: 21270. > 21270 2017-05-07 16:04:16.520 PDT LOG: process 21270 acquired > AccessShareLock on relation 6100 of database 0 after 1003.608 ms > 25493 DROP SUBSCRIPTION 2017-05-07 16:04:16.520 PDT LOG: duration: > 1005.176 ms CPU: user: 0.00 s, system: 0.00 s, elapsed: 1.00 s > statement: drop subscription foobar ; > Yeah that's normal because DropSubscription holds exclusive lock on pg_subscription. I am guessing 21270 is the logical replication launcher? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 08/05/17 13:47, Petr Jelinek wrote: > On 08/05/17 01:17, Jeff Janes wrote: >> After dropping a subscription, it says it succeeded and that it dropped >> the slot on the publisher. >> >> But the publisher still has the slot, and a full-tilt process described >> by ps as >> >> postgres: wal sender process jjanes [local] idle in transaction >> >> Strace shows that this process is doing nothing but opening, reading, >> lseek, and closing from pg_wal, and calling sbrk. It never sends anything. >> >> This is not how it should work, correct? >> > > No, and I don't see how this happens though, we only report success if > the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I > don't see anything in source that would explain this. I will need to > reproduce it first to see what's happening (wasn't able to do that yet, > but it might just need more time since you say it does no happen always). > Hm I wonder are there any workers left on subscriber when this happens? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 9, 2017 at 9:18 AM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
On 08/05/17 13:47, Petr Jelinek wrote:
> On 08/05/17 01:17, Jeff Janes wrote:
>> After dropping a subscription, it says it succeeded and that it dropped
>> the slot on the publisher.
>>
>> But the publisher still has the slot, and a full-tilt process described
>> by ps as
>>
>> postgres: wal sender process jjanes [local] idle in transaction
>>
>> Strace shows that this process is doing nothing but opening, reading,
>> lseek, and closing from pg_wal, and calling sbrk. It never sends anything.
>>
>> This is not how it should work, correct?
>>
>
> No, and I don't see how this happens though, we only report success if
> the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I
> don't see anything in source that would explain this. I will need to
> reproduce it first to see what's happening (wasn't able to do that yet,
> but it might just need more time since you say it does no happen always).
>
Hm I wonder are there any workers left on subscriber when this happens?
Yes. using ps, I get this:
postgres: bgworker: logical replication worker for subscription 16408 sync 16391
postgres: bgworker: logical replication worker for subscription 16408 sync 16388
They seem to be permanently blocked on a socket to read from the publisher.
On the publisher side, I think it is very slowly assembling a snapshot. It seems to be adding one xid at a time, and then re-sorting the entire list. Over and over.
Cheers,
Jeff
On 09/05/17 19:13, Jeff Janes wrote: > On Tue, May 9, 2017 at 9:18 AM, Petr Jelinek > <petr.jelinek@2ndquadrant.com <mailto:petr.jelinek@2ndquadrant.com>> wrote: > > On 08/05/17 13:47, Petr Jelinek wrote: > > On 08/05/17 01:17, Jeff Janes wrote: > >> After dropping a subscription, it says it succeeded and that it dropped > >> the slot on the publisher. > >> > >> But the publisher still has the slot, and a full-tilt process described > >> by ps as > >> > >> postgres: wal sender process jjanes [local] idle in transaction > >> > >> Strace shows that this process is doing nothing but opening, reading, > >> lseek, and closing from pg_wal, and calling sbrk. It never sends anything. > >> > >> This is not how it should work, correct? > >> > > > > No, and I don't see how this happens though, we only report success if > > the publisher side said that DROP_REPLICATION_SLOT succeeded. So far I > > don't see anything in source that would explain this. I will need to > > reproduce it first to see what's happening (wasn't able to do that yet, > > but it might just need more time since you say it does no happen always). > > > > Hm I wonder are there any workers left on subscriber when this happens? > > > Yes. using ps, I get this: > > postgres: bgworker: logical replication worker for subscription 16408 > sync 16391 > postgres: bgworker: logical replication worker for subscription 16408 > sync 16388 > > They seem to be permanently blocked on a socket to read from the publisher. > > On the publisher side, I think it is very slowly assembling a snapshot. > It seems to be adding one xid at a time, and then re-sorting the entire > list. Over and over. > Okay, then it's the same issue Masahiko Sawada reported in nearby thread, or at least has same cause. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 10, 2017 at 3:06 AM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > Okay, then it's the same issue Masahiko Sawada reported in nearby > thread, or at least has same cause. This report is here: https://www.postgresql.org/message-id/CAD21AoBYpyqTSw+=ES+xXtRGMPKh=pKiqjNxZKnNUae0pSt9bg@mail.gmail.com It may be useful to keep all discussions on the other thread as that's the first report. -- Michael