Re: [HACKERS] Interval for launching the table sync worker - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: [HACKERS] Interval for launching the table sync worker |
Date | |
Msg-id | CAD21AoAC44RCid7gnXdYWrHHDYLUbnQSWxk8HDjaeROEVGqeKg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Interval for launching the table sync worker (Petr Jelinek <petr.jelinek@2ndquadrant.com>) |
List | pgsql-hackers |
On Fri, Apr 14, 2017 at 9:14 PM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > On 14/04/17 12:18, Masahiko Sawada wrote: >> On Fri, Apr 14, 2017 at 7:09 AM, Petr Jelinek >> <petr.jelinek@2ndquadrant.com> wrote: >>> On 13/04/17 12:23, Masahiko Sawada wrote: >>>> On Thu, Apr 13, 2017 at 11:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Wed, Apr 12, 2017 at 11:46 PM, Peter Eisentraut >>>>> <peter.eisentraut@2ndquadrant.com> wrote: >>>>>> On 4/12/17 00:48, Masahiko Sawada wrote: >>>>>>> On Wed, Apr 12, 2017 at 1:28 PM, Peter Eisentraut >>>>>>>> Perhaps instead of a global last_start_time, we store a per relation >>>>>>>> last_start_time in SubscriptionRelState? >>>>>>> >>>>>>> I was thinking the same. But a problem is that the list of >>>>>>> SubscriptionRelState is refreshed whenever the syncing table state >>>>>>> becomes invalid (table_state_valid = false). I guess we need to >>>>>>> improve these logic including GetSubscriptionNotReadyRelations(). >>>>>> >>>>>> The table states are invalidated on a syscache callback from >>>>>> pg_subscription_rel, which happens roughly speaking when a table >>>>>> finishes the initial sync. So if we're worried about failing tablesync >>>>>> workers relaunching to quickly, this would only be a problem if a >>>>>> tablesync of another table finishes right in that restart window. That >>>>>> doesn't seem a terrible issue to me. >>>>>> >>>>> >>>>> I think the table states are invalidated whenever the table sync >>>>> worker starts, because the table sync worker updates its status of >>>>> pg_subscription_rel and commits it before starting actual copy. So we >>>>> cannot rely on that. I thought we can store last_start_time into >>>>> pg_subscription_rel but it might be overkill. I'm now thinking to >>>>> change GetSubscriptionNotReadyRealtions so that last_start_time in >>>>> SubscriptionRelState is taken over to new list. >>>>> >>>> >>>> Attached the latest patch. It didn't actually necessary to change >>>> GetSubscriptionNotReadyRelations. I just changed the logic refreshing >>>> the sync table state list. >>>> Please give me feedback. >>>> >>> >>> Hmm this might work. Although I was actually wondering if we could store >>> the last start timestamp in the worker shared memory and do some magic >>> with that (ie, not clearing subid and relid and try to then do rate >>> limiting based on subid+relid+timestamp stored in shmem). That would >>> then work same way for the main apply workers as well. It would have the >>> disadvantage that if some tables were consistently failing, no other >>> tables could get synchronized as the amount of workers is limited. >> >> Hmm I guess that it's not a good design that a table sync worker and a >> apply worker for a relation takes sole possession of a worker slot >> until it successes. It would bother each other. >> > > Not sure what you mean by apply worker for relation, I meant the main > subscription apply worker, the "per relation apply worker" is same as > tablesync worker. Oops, I made a mistake. I meant just a apply worker. > > Thinking about the possible failure scenarios more, I guess you are > right. Because if there is "global" event for the publisher/subscriber > connection (ie network goes down or something) it will affect the main > worker as well so it won't launch anything. If there is table specific > issue it will affect only that table. > > The corner-cases with doing it per table where we launch tablesync > needlessly are pretty much just a) connection limit on publisher > reached, b) slot limit on publisher reached, that's probably okay. We > might be able to catch these two specifically and do some global > limiting based on those (something like your original patch except the > start-time global variable would only be set on specific error and > stored in shmem), but that does not need to be solved immediately. > > -- > Petr Jelinek http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
pgsql-hackers by date: