RE: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Hayato Kuroda (Fujitsu) |
---|---|
Subject | RE: Synchronizing slots from primary to standby |
Date | |
Msg-id | TYAPR01MB5866A953535944E6FE6A86AEF5C5A@TYAPR01MB5866.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
Responses |
RE: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby Re: Synchronizing slots from primary to standby |
List | pgsql-hackers |
Dear Shveta, Thank you for updating the patch! I found another ERROR due to the slot removal. Is this a real issue? 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked and the primary crashed during the initial sync. 2. executed test_0925_v2.sh (You attached in [1]) 3. secondary could not start the logical replication because the slot was not created (log files were also attached). Here is my analysis. The cause is that the slotsync worker aborts the slot creation on secondary server because the restart_lsn of secondary ahead the primary's one. IIUC it can be occurred when tablesync workers finishes initial copy before walsenders stream changes. In this case, the relstate of the worker is set to SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until the relation is caught up. If some changes are come and the primary crashes at that time, the syncslot worker will abort the slot creation. Anyway, followings are my comments. I have not checked detailed conventions yet. It should be done in later stage. ~~~~~~~~~~~~~~~~ For 0001: === WalSndWaitForStandbyConfirmation() ``` + /* If postmaster asked us to stop, don't wait anymore */ + if (got_STOPPING) + break; ``` I have considered again, and it may still have an issue: logical walsenders may break from the loop before physical walsenders send WALs. This may be occurred because both physical and logical walsenders would get PROCSIG_WALSND_INIT_STOPPING. I think a function like WalSndWaitStopping() must be needed, which waits until physical walsenders become WALSNDSTATE_STOPPING or exit. Thought? WalSndWaitForStandbyConfirmation() ``` + standby_slot_cpy = list_copy(standby_slot_names_list); ``` I found that standby_slot_names_list and standby_slot_cpy would not be updated even if the GUC was updated. Is this acceptable? Won't it be occurred after you refactor the patch? What would be occurred when synchronize_slot_names is updated on secondary while primary executes this? WalSndWaitForStandbyConfirmation() ``` + + goto retry; ``` I checked other "goto retry;", but I could not find the pattern that the return clause does not exist after the goto (exception: void function). I also think that current style seems a bit strange. How about using an outer loop like While (list_length(standby_slot_cpy))? ===== slot.h ``` +extern void WaitForStandbyLSN(XLogRecPtr wait_for_lsn); ``` WaitForStandbyLSN() does not exist. ~~~~~~~~~~~~~~~~ For 0002: ===== General The patch requires that primary_conninfo must contain the dbname, but it conflicts with documentation. It says: ``` ...Do not specify a database name in the primary_conninfo string. ``` I confirmed [^a] it is harmless that primary_conninfo has dbname, but at least the description must be fixed. General I found that primary server output huge amount of logs when the log_min_duration_messages = 0. This ie because slotsync worker sends an SQL per 10ms, in wait_for_primary_slot_catchup(). Is there any good way to suppress it? Or, should we be patient? ===== ``` +{ oid => '6312', descr => 'what caused the replication slot to become invalid', ``` How did you determine the oid? IIRC, developping features should use oids in the range 8000-9999. See src/include/catalog/unused_oids. ===== LogicalRepCtxStruct ``` /* Background workers. */ + SlotSyncWorker *ss_workers; /* slot-sync workers */ LogicalRepWorker workers[FLEXIBLE_ARRAY_MEMBER]; ``` It's OK for now, but can we combine them into an array? IIUC there is no possibility to exist both of processes and they have same component, so it may be able to be same. It can reduce an attribute but may lead some difficulties to read. WaitForReplicationWorkerAttach() and logicalrep_worker_stop_internal() I could not find cases that has "LWLock *" as an argument (exception: functions in lwlock.c). Is it sufficient to check RecoveryInProgress() instead of specifying as arguments? ===== wait_for_primary_slot_catchup() ``` + /* Check if this standby is promoted while we are waiting */ + if (!RecoveryInProgress()) + { + /* + * The remote slot didn't pass the locally reserved position at + * the time of local promotion, so it's not safe to use. + */ + ereport( + WARNING, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg( + "slot-sync wait for slot %s interrupted by promotion, " + "slot creation aborted", remote_slot->name))); + pfree(cmd.data); + return false; + } ``` The part would not be executed if the promote signal is sent after the primary server crashes. I think walrcv_exec() will detect the failure first. The function must be wrapped by PG_TRY() and the message must be emitted in PG_CATCH(). There may be other approaches. wait_for_primary_slot_catchup() ``` + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + WORKER_DEFAULT_NAPTIME_MS, + WAIT_EVENT_REPL_SLOTSYNC_MAIN); ``` New wait event can be added. [1]: https://www.postgresql.org/message-id/CAJpy0uDD%2B9aJnDx9fBfvLvxJtxA7qqoAys4fo6h1tq1b_0_A7Q%40mail.gmail.com [^a] Regarding the secondary side, the libpqrcv_connect() does not do special things even if the primary_conninfo has dbname="XXX". It adds parameters like "replication=true" and sends a startup packet. As for the primary side, the startup packet is consumed in ProcessStartupPacket(). It checks whether the process should be a walsender or not (line 2204). Then (line 2290) the port->database_name[0] is set as '\0' in case of walsender. The value is used for setting the process title in BackendInitialize(). Also, InitPostgres() really sets some global variables like MyDatabaseId, but it is not occurred when the process is walsender. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
pgsql-hackers by date: