RE: Excessive number of replication slots for 12->14 logical replication - Mailing list pgsql-bugs
From | houzj.fnst@fujitsu.com |
---|---|
Subject | RE: Excessive number of replication slots for 12->14 logical replication |
Date | |
Msg-id | OS0PR01MB5716ED883D44E2214F3563B094959@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Excessive number of replication slots for 12->14 logical replication (Ajin Cherian <itsajin@gmail.com>) |
Responses |
Re: Excessive number of replication slots for 12->14 logical replication
|
List | pgsql-bugs |
On Sunday, July 24, 2022 4:17 PM Ajin Cherian <itsajin@gmail.com> wrote: > On Sun, Jul 24, 2022 at 6:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 18, 2022 at 3:13 PM hubert depesz lubaczewski > > <depesz@depesz.com> wrote: > > > > > > On Mon, Jul 18, 2022 at 09:07:35AM +0530, Amit Kapila wrote: > > > > > > First error: > > > #v+ > > > 2022-07-18 09:22:07.046 UTC,,,4145917,,62d5263f.3f42fd,2,,2022-07-18 > > > 09:22:07 UTC,28/21641,1219146,ERROR,53400,"could not find free > > > replication state slot for replication origin with OID > > > 51",,"Increase max_replication_slots and try > > > again.",,,,,,,"","logical replication worker",,0 > > > #v- > > > > > > Nothing else errored out before, no warning, no fatals. > > > > > > from the first ERROR I was getting them in the range of 40-70 per minute. > > > > > > At the same time I was logging data from `select now(), * from > pg_replication_slots`, every 2 seconds. > > > > > ... > > > > > > So, it looks that there are up to 10 focal slots, all active, and then there are > sync slots with weirdly high counts for inactive ones. > > > > > > At most, I had 11 active sync slots. > > > > > > Looks like some kind of timing issue, which would be inline with > > > what Kyotaro Horiguchi wrote initially. > > > > > > > I think this is a timing issue similar to what Horiguchi-San has > > pointed out but due to replication origins. We drop the replication > > origin after the sync worker that has used it is finished. This is > > done by the apply worker because we don't allow to drop the origin > > till the process owning the origin is alive. I am not sure of > > repercussions but maybe we can allow dropping the origin by the > > process that owns it. > > > > I have written a patch which will do the dropping of replication origins in the > sync worker itself. > I had to reset the origin session (which also resets the owned by > flag) prior to the dropping of the slots. Thanks for the patch. I tried the patch and confirmed that we won't get the ERROR "could not find free replication state slot for replication origin with OID" again after applying the patch. I tested the patch by letting the apply worker wait for a bit more time after setting the state to SUBREL_STATE_CATCHUP. In this case(before the patch) the table sync worker will exit before the apply worker drop the replorigin, and the apply worker will try to start another worker which would cause the ERROR(before the patch). Few comments: 1) - * There is a chance that the user is concurrently performing - * refresh for the subscription where we remove the table - * state and its origin and by this time the origin might be - * already removed. So passing missing_ok = true. - */ I think it would be better if we can move these comments to the new place where we drop the replorigin. 2) - replorigin_drop_by_name(originname, true, false); /* * Update the state to READY only after the origin cleanup. Do we need to slightly modify the comment here as the origin drop code has been moved to other places. Maybe "It's safe to update the state to READY as the origin should have been dropped by table sync worker". Best regards, Hou zj
pgsql-bugs by date: