RE: Deadlock between logrep apply worker and tablesync worker - Mailing list pgsql-hackers
From | houzj.fnst@fujitsu.com |
---|---|
Subject | RE: Deadlock between logrep apply worker and tablesync worker |
Date | |
Msg-id | OS0PR01MB57165EFD2571ED4085A8158494D69@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Deadlock between logrep apply worker and tablesync worker (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Deadlock between logrep apply worker and tablesync worker
|
List | pgsql-hackers |
On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> wrote: > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Monday, January 30, 2023 2:32 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> > wrote: > > > > > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > > > > > One thing that looks a bit odd is that we will anyway have a > > > > > > similar check in replorigin_drop_guts() which is a static > > > > > > function and called from only one place, so, will it be required to > check at both places? > > > > > > > > > > There is a possibility that the initial check to verify if > > > > > replication origin exists in replorigin_drop_by_name was > > > > > successful but later one of either table sync worker or apply > > > > > worker process might have dropped the replication origin, > > > > > > > > > > > > > Won't locking on the particular origin prevent concurrent drops? > > > > IIUC, the drop happens after the patch acquires the lock on the origin. > > > > > > Yes, I think the existence check in replorigin_drop_guts is > > > unnecessary as we already lock the origin before that. I think the > > > check in replorigin_drop_guts is a custom check after calling > > > SearchSysCache1 to get the tuple, but the error should not happen as no > concurrent drop can be performed. > > > > > > To make it simpler, one idea is to move the code that getting the > > > tuple from system cache to the replorigin_drop_by_name(). After > > > locking the origin, we can try to get the tuple and do the existence > > > check, and we can reuse this tuple to perform origin delete. In this > > > approach we only need to check origin existence once after locking. > > > BTW, if we do this, then we'd better rename the > > > replorigin_drop_guts() to something like replorigin_state_clear() as > > > the function only clear the in-memory information after that. > > > > > > > The attached updated patch has the changes to handle the same. > > I had not merged one of the local changes that was present, please find the > updated patch including that change. Sorry for missing that change. > I also tried to test the time of "src/test/subscription/t/002_types.pl" before and after the patch(change the lock level) and Tom's patch(split transaction) like what Vignesh has shared on -hackers. I run about 100 times for each case. Tom's and the lock level patch behave similarly on my machines[1]. HEAD: 3426 ~ 6425 ms HEAD + Tom: 3404 ~ 3462 ms HEAD + Vignesh: 3419 ~ 3474 ms HEAD + Tom + Vignesh: 3408 ~ 3454 ms Even apart from the testing time reduction, reducing the lock level and lock the specific object can also help improve the lock contention which user(that use the exposed function) , table sync worker and apply worker can also benefit from it. So, I think pushing the patch to change the lock level makes sense. And the patch looks good to me. While on it, after pushing the patch, I think there is another case might also worth to be improved, that is the table sync and apply worker try to drop the same origin which might cause some delay. This is another case(different from the deadlock), so I feel we can try to improve this in another patch. [1] CentOS 8.2, 128G RAM, 40 processors Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Best Regards, Hou zj
pgsql-hackers by date: