Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Petr Jelinek |
---|---|
Subject | Synchronizing slots from primary to standby |
Date | |
Msg-id | 3095349b-44d4-bf11-1b33-7eefb585d578@2ndquadrant.com Whole thread Raw |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
Hi, As Andres has mentioned over at minimal decoding on standby thread [1], that functionality can be used to add simple worker which periodically synchronizes the slot state from the primary to a standby. Attached patch is rough implementation of such worker. It's nowhere near committable in the current state, it servers primarily two purposes - to have something over what we can agree on the approach (and if we do, serve as base for that) and to demonstrate that the patch in [1] can indeed be used for this functionality. All this means that this patch depends on the [1] to work. The approach chosen by me is to change the logical replication launcher to run also on a standby and to support new type of worker which is started on a standby for the slot synchronization. The new worker (slotsync) is responsible for periodically fetching information from the primary server and moving slots on the standby forward (using the fast forwarding functionality added in PG11) based on that. There is one worker per database (logical slots are per database, walrcv_exec needs db connection, etc). I had to add new replication command for listing slots so that the launcher can check which databases on the upstream actually have slots and start the slotsync only for those. The second patch in the series just adds ability to filter which slots are actually synchronized. This approach should be eventually portable to logical replication as well. The only difference there is that we need to be able to map lsns of the publisher to the lsns of the subscriber. We already do that in apply so that should be doable, I don't have that as goal for first version of the feature though. The basic functionality seems to be working pretty well, however there are several discussion points and unfinished parts: a) Do we want to automatically create and drop slots when they get created on the primary? Currently the patch does auto-create but does not auto-drop yet. There is no way to signal that slot was dropped so I don't see straightforward way to differentiate between slots that have been dropped on master and those that only exist on standby. I guess if we added the second feature with slot list as well we could drop anything on that list that's not on primary... b) The slot creation is somewhat interesting. The slot might be created while standby does not have wal for existing slots on primary because they are behind of standby. We solve it by creating ephemeral slot and wait for the primary slot to pass it's lsn before persisting it (similarly to when we are trying to build initial snapshot). This seems reasonable to me but the coding could use another pair of eyes there. c) With the periodical start/stop (for the move) of the decoding on the slot, the logging of every start of decoding context is pretty annoying/spammy, we should probably tune that down. d) The launcher integration needs improvement - add worker kind rather than guessing from values of dbid, subid and relid and do decisions based on that. Also the interfaces for manipulating the workers should probably use LogicalRepWorkerId rather than above mentioned parameters and guessing everywhere. e) We probably should support synchronizing physical slots as well (currently we only sync logical slots). But that should be easy provided we don't mind that logical replication launcher is somewhat misnomer then... f) Maybe walreceiver or startup should signal these new workers if enough data is processed, so it's not purely time based. But I think that kind of optimization can be left for later. Also (these are pretty pointless until we agree that this is the right approach): - there is no documentation update yet - there are no TAP tests yet - the recheck timer might need GUC [1] https://www.postgresql.org/message-id/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
pgsql-hackers by date: