subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo |
Date | |
Msg-id | CAHut+PuEsekA3e7ThwzWr+Us4x=LzkF7DSrED1UsZTUqNrhCUQ@mail.gmail.com Whole thread Raw |
Responses |
Re: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo
|
List | pgsql-hackers |
Hi, I had reported a possible subscription 'disable_on_error' bug found while reviewing another patch. I am including my initial report and Nisha's analysis again here so that this topic has its own thread. ================== INITIAL REPORT [1] ================== ... I see now that any ALTER of the subscription's connection, even to some value that fails, will restart a new worker (like ALTER of any other subscription parameters). For a bad connection, it will continue to relaunch-worker/ERROR over and over. e.g. ---------- test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG: logical replication apply worker for subscription "sub4" has started 2024-01-17 09:34:28.666 AEDT [11274] ERROR: could not connect to the publisher: invalid port number: "-1" 2024-01-17 09:34:28.667 AEDT [928] LOG: background worker "logical replication apply worker" (PID 11274) exited with exit code 1 2024-01-17 09:34:33.669 AEDT [11391] LOG: logical replication apply worker for subscription "sub4" has started 2024-01-17 09:34:33.669 AEDT [11391] ERROR: could not connect to the publisher: invalid port number: "-1" 2024-01-17 09:34:33.670 AEDT [928] LOG: background worker "logical replication apply worker" (PID 11391) exited with exit code 1 etc... ---------- While experimenting with the bad connection ALTER I also tried setting 'disable_on_error' like below: ALTER SUBSCRIPTION sub4 SET (disable_on_error); ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1'; ...but here the subscription did not become DISABLED as I expected it would do on the next connection error iteration. It remains enabled and just continues to loop relaunch/ERROR indefinitely same as before. That looks like it may be a bug. Thoughts? ===================== ANALYSIS BY NISHA [2] ===================== Ideally, if the already running apply worker in "LogicalRepApplyLoop()" has any exception/error it will be handled and the subscription will be disabled if 'disable_on_error' is set - start_apply(XLogRecPtr origin_startpos) { PG_TRY(); { LogicalRepApplyLoop(origin_startpos); } PG_CATCH(); { if (MySubscription->disableonerr) DisableSubscriptionAndExit(); ... What is happening in this case is that the control reaches the function - run_apply_worker() -> start_apply() -> LogicalRepApplyLoop -> maybe_reread_subscription() ... /* * Exit if any parameter that affects the remote connection was changed. * The launcher will start a new worker but note that the parallel apply * worker won't restart if the streaming option's value is changed from * 'parallel' to any other value or the server decides not to stream the * in-progress transaction. */ if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 || ... and it sees a change in the parameter and calls apply_worker_exit(). This will exit the current process, without throwing an exception to the caller and the postmaster will try to restart the apply worker. The new apply worker, before reaching the start_apply() [where we handle exception], will hit the code to establish the connection to the publisher - ApplyWorkerMain() -> run_apply_worker() - ... LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, true /* replication */ , true, must_use_password, MySubscription->name, &err); if (LogRepWorkerWalRcvConn == NULL) ereport(ERROR, (errcode(ERRCODE_CONNECTION_FAILURE), errmsg("could not connect to the publisher: %s", err))); ... and due to the bad connection string in the subscription, it will error out. [28680] ERROR: could not connect to the publisher: invalid port number: "-1" [3196] LOG: background worker "logical replication apply worker" (PID 28680) exited with exit code 1 Now, the postmaster keeps trying to restart the apply worker and it will keep failing until the connection string is corrected or the subscription is disabled manually. I think this is a bug that needs to be handled in run_apply_worker() when disable_on_error is set. ====== [1] https://www.postgresql.org/message-id/CAHut%2BPvcf7P2dHbCeWPM4jQ%3DyHqf4WFS_C6eVb8V%3DbcZPMMp7A%40mail.gmail.com [2] https://www.postgresql.org/message-id/CABdArM6ORu%2BKpS_kXd-jwwPBqYPo1YqZjwwGnqmVanWgdHCggA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: