Re: [HACKERS] Interval for launching the table sync worker - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [HACKERS] Interval for launching the table sync worker
Date
Msg-id CAD21AoAC44RCid7gnXdYWrHHDYLUbnQSWxk8HDjaeROEVGqeKg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Interval for launching the table sync worker  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
List pgsql-hackers
On Fri, Apr 14, 2017 at 9:14 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 14/04/17 12:18, Masahiko Sawada wrote:
>> On Fri, Apr 14, 2017 at 7:09 AM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>> On 13/04/17 12:23, Masahiko Sawada wrote:
>>>> On Thu, Apr 13, 2017 at 11:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> On Wed, Apr 12, 2017 at 11:46 PM, Peter Eisentraut
>>>>> <peter.eisentraut@2ndquadrant.com> wrote:
>>>>>> On 4/12/17 00:48, Masahiko Sawada wrote:
>>>>>>> On Wed, Apr 12, 2017 at 1:28 PM, Peter Eisentraut
>>>>>>>> Perhaps instead of a global last_start_time, we store a per relation
>>>>>>>> last_start_time in SubscriptionRelState?
>>>>>>>
>>>>>>> I was thinking the same. But a problem is that the list of
>>>>>>> SubscriptionRelState is refreshed whenever the syncing table state
>>>>>>> becomes invalid (table_state_valid = false). I guess we need to
>>>>>>> improve these logic including GetSubscriptionNotReadyRelations().
>>>>>>
>>>>>> The table states are invalidated on a syscache callback from
>>>>>> pg_subscription_rel, which happens roughly speaking when a table
>>>>>> finishes the initial sync.  So if we're worried about failing tablesync
>>>>>> workers relaunching to quickly, this would only be a problem if a
>>>>>> tablesync of another table finishes right in that restart window.  That
>>>>>> doesn't seem a terrible issue to me.
>>>>>>
>>>>>
>>>>> I think the table states are invalidated whenever the table sync
>>>>> worker starts, because the table sync worker updates its status of
>>>>> pg_subscription_rel and commits it before starting actual copy. So we
>>>>> cannot rely on that. I thought we can store last_start_time into
>>>>> pg_subscription_rel but it might be overkill. I'm now thinking to
>>>>> change GetSubscriptionNotReadyRealtions so that last_start_time in
>>>>> SubscriptionRelState is taken over to new list.
>>>>>
>>>>
>>>> Attached the latest patch. It didn't actually necessary to change
>>>> GetSubscriptionNotReadyRelations. I just changed the logic refreshing
>>>> the sync table state list.
>>>> Please give me feedback.
>>>>
>>>
>>> Hmm this might work. Although I was actually wondering if we could store
>>> the last start timestamp in the worker shared memory and do some magic
>>> with that (ie, not clearing subid and relid and try to then do rate
>>> limiting based on subid+relid+timestamp stored in shmem). That would
>>> then work same way for the main apply workers as well. It would have the
>>> disadvantage that if some tables were consistently failing, no other
>>> tables could get synchronized as the amount of workers is limited.
>>
>> Hmm I guess that it's not a good design that a table sync worker and a
>> apply worker for a relation takes sole possession of a worker slot
>> until it successes. It would bother each other.
>>
>
> Not sure what you mean by apply worker for relation, I meant the main
> subscription apply worker, the "per relation apply worker" is same as
> tablesync worker.

Oops, I made a mistake. I meant just a apply worker.

>
> Thinking about the possible failure scenarios more, I guess you are
> right. Because if there is "global" event for the publisher/subscriber
> connection (ie network goes down or something) it will affect the main
> worker as well so it won't launch anything. If there is table specific
> issue it will affect only that table.
>
> The corner-cases with doing it per table where we launch tablesync
> needlessly are pretty much just a) connection limit on publisher
> reached, b) slot limit  on publisher reached, that's probably okay. We
> might be able to catch these two specifically and do some global
> limiting based on those (something like your original patch except the
> start-time global variable would only be set on specific error and
> stored in shmem), but that does not need to be solved immediately.
>
> --
>   Petr Jelinek                  http://www.2ndQuadrant.com/
>   PostgreSQL Development, 24x7 Support, Training & Services


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] [pgsql-www] Small issue in online devel documentationbuild
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Should pg_current_wal_location() becomepg_current_wal_lsn()