Thread: [BUG] Logical replica crash if there was an error in a function.
Hello! Here is a fix for the bug first described in: https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru Reproduction: 1) On master with 'wal_level = logical' execute mascmd.sql attached. 2) On replica substitute the correct port in repcmd.sql and execute it. 3) On master execute command: INSERT INTO rul_rule_set VALUES ('1', 'name','1','age','true'); Replica will crash with: FailedAssertion("ActivePortal && ActivePortal->status == PORTAL_ACTIVE", File: "pg_proc.c", Line: 1038, PID: 42894) in infinite loop. After applying this patch replica will give the correct error message instead of assertion: 2022-08-21 17:08:39.935 MSK [143171] ERROR: relation "rul_rule_set" does not exist at character 172 2022-08-21 17:08:39.935 MSK [143171] QUERY: -- Last modified: 2022-08-21 17:08:39.930842+03 with parameters as ( <<--- skipped by me --- >>> ) 2022-08-21 17:08:39.935 MSK [143171] CONTEXT: SQL statement "create or replace function public.rule_set_selector( <<--- skipped by me --- >>> SQL statement "call public.rebuild_rule_set_selector()" PL/pgSQL function public.rul_rule_set_trg() line 4 at CALL processing remote data for replication origin "pg_16401" during "INSERT" for replication target relation "public.rul_rule_set" in transaction 741 finished at 0/17BE180 With best regards, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello! On 21.08.2022 17:33, Anton A. Melnikov wrote: > Hello! > > Here is a fix for the bug first described in: > https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru > Sorry, there was a wrong patch in the first letter. Here is a right version. With best regards, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello! The patch was rebased on current master. And here is a simplified crash reproduction: 1) On primary with 'wal_level = logical' execute: CREATE TABLE public.test (id int NOT NULL, val integer); CREATE PUBLICATION test_pub FOR TABLE test; 2) On replica replace XXXX in the repcmd.sql attached with primary port and execute it: psql -f repcmd.sql 3) On master execute command: INSERT INTO test VALUES ('1'); With best regards, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello! Added a TAP test for this case. On 30.08.2022 10:09, Anton A. Melnikov wrote: > Hello! > > The patch was rebased on current master. > And here is a simplified crash reproduction: > 1) On primary with 'wal_level = logical' execute: > CREATE TABLE public.test (id int NOT NULL, val integer); > CREATE PUBLICATION test_pub FOR TABLE test; > > 2) On replica replace XXXX in the repcmd.sql attached with primary port and execute it: > psql -f repcmd.sql > > 3) On master execute command: > INSERT INTO test VALUES ('1'); > With best regards, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
"Anton A. Melnikov" <aamelnikov@inbox.ru> writes: > [ v4-0001-Fix-logical-replica-assert-on-func-error.patch ] I took a quick look at this. I think you're solving the problem in the wrong place. The real issue is why are we not setting up ActivePortal correctly when running user-defined code in a logrep worker? There is other code that expects that to be set, eg EnsurePortalSnapshotExists. regards, tom lane
Hello! Thanks for reply! On 24.09.2022 20:27, Tom Lane wrote: > I think you're solving the > problem in the wrong place. The real issue is why are > we not setting up ActivePortal correctly when running > user-defined code in a logrep worker? During a common query from the backend ActivePortal becomes defined in the PortalRun and then AfterTriggerEndQuery starts with non-NULL ActivePortal after ExecutorFinish. When the logrep worker is applying messages there are neither PortalStart nor PortalRun calls. And AfterTriggerEndQuery starts with undefined ActivePortal after finish-edata(). May be it's normal behavior? > There is other code > that expects that to be set, eg EnsurePortalSnapshotExists. When the logrep worker is applying message it doesn't have to use ActivePortal in EnsurePortalSnapshotExists because ActiveSnapshot is already installed. It is set at the beginning of each transaction in the begin_replication_step call. On the other hand, function_parse_error_transpose() tries to get the original query text (INSERT INTO test VALUES ('1') in our case) from the ActivePortal to clarify the location of the error. But in the logrep worker there is no way to restore original query text from the logrep message. There is only 'zipped' query equivalent to the original. So any function_parse_error_transpose() call seems to be useless in the logrep worker. And it looks like we can simply omit match_prosrc_to_query() call there. The attached patch does it. Best wishes, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2022-Sep-24, Tom Lane wrote: > "Anton A. Melnikov" <aamelnikov@inbox.ru> writes: > > [ v4-0001-Fix-logical-replica-assert-on-func-error.patch ] > > I took a quick look at this. I think you're solving the > problem in the wrong place. The real issue is why are > we not setting up ActivePortal correctly when running > user-defined code in a logrep worker? There is other code > that expects that to be set, eg EnsurePortalSnapshotExists. Right ... mostly, the logical replication *does* attempt to set up a transaction and active snapshot when replaying actions (c.f. begin_replication_step()). Is this firing at an inappropriate time, perhaps? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Sun, Oct 09, 2022 at 12:24:23PM +0300, Anton A. Melnikov wrote: > On the other hand, function_parse_error_transpose() tries to get > the original query text (INSERT INTO test VALUES ('1') in our case) from > the ActivePortal to clarify the location of the error. > But in the logrep worker there is no way to restore original query text > from the logrep message. There is only 'zipped' query equivalent to the original. > So any function_parse_error_transpose() call seems to be useless > in the logrep worker. Yeah, the query string is not available in this context, but it surely looks wrong to me to assume that something as low-level as function_parse_error_transpose() needs to be updated for the sake of a logical worker, while we have other areas that would expect a portal to be set up. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Yeah, the query string is not available in this context, but it surely > looks wrong to me to assume that something as low-level as > function_parse_error_transpose() needs to be updated for the sake of a > logical worker, while we have other areas that would expect a portal > to be set up. After thinking about this some more, I'm withdrawing my opposition to fixing it by making function_parse_error_transpose() cope with not having an active portal. I have a few reasons: * A Portal is intended to contain an executor state. While worker.c does fake up an EState, there's certainly no plan tree or planstate tree, and I doubt it'd be sane to create dummy ones. So even if we made a Portal, it'd be lacking a lot of the stuff one would expect to find there. I fear that moving the cause of this sort of problem from "there's no ActivePortal" to "there's an ActivePortal but it lacks field X" wouldn't be an improvement. * There is actually just one other consumer of ActivePortal, namely EnsurePortalSnapshotExists, and that doesn't offer a lot of support for the idea that ActivePortal must always be set. It says * Nothing to do if a snapshot is set. (We take it on faith that the * outermost active snapshot belongs to some Portal; or if there is no * Portal, it's somebody else's responsibility to manage things.) and "it's somebody else's responsibility" summarizes the situation here pretty perfectly. worker.c *does* set up an active snapshot. * The comment in function_parse_error_transpose() freely admits that looking at the ActivePortal is a hack. It works, more or less, for the intended case of reporting a function-body syntax error nicely during CREATE FUNCTION. But it's capable of getting false-positive matches, so maybe someday we should replace it with something more bulletproof. * There isn't any strong reason why function_parse_error_transpose() has to succeed at finding the original query text. Its fallback approach of treating the syntax error position as internal to the function body text is fine, in fact it's just what we want here. So I'm now good with the idea of just not failing. I don't like the patch as presented though. First, the cfbot is quite rightly complaining about the "uninitialized variable" warning it draws. Second, I don't see a good reason to tie the change to logical replication in any way. Let's just change the Assert to an if(), as attached. regards, tom lane diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c index a9fe45e347..e03b98bcd2 100644 --- a/src/backend/catalog/pg_proc.c +++ b/src/backend/catalog/pg_proc.c @@ -1017,7 +1017,6 @@ function_parse_error_transpose(const char *prosrc) { int origerrposition; int newerrposition; - const char *queryText; /* * Nothing to do unless we are dealing with a syntax error that has a @@ -1035,11 +1034,22 @@ function_parse_error_transpose(const char *prosrc) } /* We can get the original query text from the active portal (hack...) */ - Assert(ActivePortal && ActivePortal->status == PORTAL_ACTIVE); - queryText = ActivePortal->sourceText; + if (ActivePortal && ActivePortal->status == PORTAL_ACTIVE) + { + const char *queryText = ActivePortal->sourceText; - /* Try to locate the prosrc in the original text */ - newerrposition = match_prosrc_to_query(prosrc, queryText, origerrposition); + /* Try to locate the prosrc in the original text */ + newerrposition = match_prosrc_to_query(prosrc, queryText, + origerrposition); + } + else + { + /* + * Quietly give up if no ActivePortal. This is an unusual situation + * but it can happen in, e.g., logical replication workers. + */ + newerrposition = -1; + } if (newerrposition > 0) {
On Wed, Nov 2, 2022 at 11:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > So I'm now good with the idea of just not failing. I don't like > the patch as presented though. First, the cfbot is quite rightly > complaining about the "uninitialized variable" warning it draws. > Second, I don't see a good reason to tie the change to logical > replication in any way. Let's just change the Assert to an if(), > as attached. > LGTM. I don't know if it is a good idea to omit the test case for this scenario. If required, we can reuse the test case from Sawada-San's patch in the email [1]. [1] - https://www.postgresql.org/message-id/CAD21AoDKA%2BMB4M9BOnct_%3DZj5bNHbkYn6oKZ2aOQp8m%3D3x2GhQ%40mail.gmail.com -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > LGTM. I don't know if it is a good idea to omit the test case for this > scenario. If required, we can reuse the test case from Sawada-San's > patch in the email [1]. I don't think the cost of that test case is justified by the tiny probability that it'd ever catch anything. If we were just adding a query or two to an existing scenario, that could be okay; but spinning up and syncing a whole new primary and standby database is *expensive* when you multiply it by the number of times developers and buildfarm animals are going to run the tests in the future. There's also the little issue that I'm not sure it would actually detect a problem if we had one. The case is going to fail, and what we want to know is just how messily it fails, and I think the TAP infrastructure isn't very sensitive to that ... especially if the test isn't even checking for specific error messages. regards, tom lane
Hello! On 02.11.2022 21:02, Tom Lane wrote: > So I'm now good with the idea of just not failing. I don't like > the patch as presented though. First, the cfbot is quite rightly > complaining about the "uninitialized variable" warning it draws. > Second, I don't see a good reason to tie the change to logical > replication in any way. Let's just change the Assert to an if(), > as attached. Fully agreed that is the most optimal solution for that case. Thanks! Surely it's very rare one but there was a real segfault at production server. Someone came up with the idea to modify function like public.test_selector() in repcmd.sql (see above) on the fly with adding to it :last_modification: field from current time and some other parameters with the help of replace() inside the creation of the rebuild_test() procedure. On 03.11.2022 18:29, Tom Lane wrote: > Amit Kapila <amit.kapila16@gmail.com> writes: >> LGTM. I don't know if it is a good idea to omit the test case for this >> scenario. If required, we can reuse the test case from Sawada-San's >> patch in the email [1]. > > I don't think the cost of that test case is justified by the tiny > probability that it'd ever catch anything. If we were just adding a > query or two to an existing scenario, that could be okay; but spinning > up and syncing a whole new primary and standby database is *expensive* > when you multiply it by the number of times developers and buildfarm > animals are going to run the tests in the future. > > There's also the little issue that I'm not sure it would actually > detect a problem if we had one. The case is going to fail, and > what we want to know is just how messily it fails, and I think the > TAP infrastructure isn't very sensitive to that ... especially > if the test isn't even checking for specific error messages. Seems it is possible to do a test without these remarks. The attached test uses existing nodes and checks the specific error message. Additionally i've tried to reduce overall number of nodes previously used in this test in a similar way. Would be glad for comments and remarks. With best wishes, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
"Anton A. Melnikov" <aamelnikov@inbox.ru> writes: > On 02.11.2022 21:02, Tom Lane wrote: >> I don't think the cost of that test case is justified by the tiny >> probability that it'd ever catch anything. > Seems it is possible to do a test without these remarks. The attached > test uses existing nodes and checks the specific error message. My opinion remains unchanged: this would be a very poor use of test cycles. > Additionally > i've tried to reduce overall number of nodes previously > used in this test in a similar way. Optimizing existing tests isn't an answer to that. We could install those optimizations without adding a new test case. regards, tom lane
Thanks a lot for the fast reply! On 03.11.2022 18:29, Tom Lane wrote: > If we were just adding a > query or two to an existing scenario, that could be okay; but spinning > up and syncing a whole new primary and standby database is *expensive* > when you multiply it by the number of times developers and buildfarm > animals are going to run the tests in the future. > On 15.11.2022 04:59, Tom Lane wrote: > "Anton A. Melnikov" <aamelnikov@inbox.ru> writes: >> On 02.11.2022 21:02, Tom Lane wrote: >>> I don't think the cost of that test case is justified by the tiny >>> probability that it'd ever catch anything. > >> Seems it is possible to do a test without these remarks. The attached >> test uses existing nodes and checks the specific error message. > > My opinion remains unchanged: this would be a very poor use of test > cycles. Sorry, i didn't fully understand what is required and added some functions to the test that spend extra cpu time. But i found that it is possible to make a test according to previous remarks by adding only a few extra queries to an existent test without any additional syncing. Experimentally, i could not observe any significant difference in CPU usage due to this test addition. The difference in the CPU usage was less than standard error. See 100_bugs-CPU-time.txt attached. >> Additionally >> i've tried to reduce overall number of nodes previously >> used in this test in a similar way. > > Optimizing existing tests isn't an answer to that. We could > install those optimizations without adding a new test case. Oh sure! Here is a separate patch for this optimization: https://www.postgresql.org/message-id/eb7aa992-c2d7-6ce7-4942-0c784231a362%40inbox.ru On the contrary with previous case in that one the difference in the CPU usage during 100_bugs.pl is essential; it decreases approximately by 30%. With the best wishes! -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi, On 2022-11-16 17:52:50 +0300, Anton A. Melnikov wrote: > Sorry, i didn't fully understand what is required and > added some functions to the test that spend extra cpu time. But i found > that it is possible to make a test according to previous remarks by adding > only a few extra queries to an existent test without any additional syncing. > > Experimentally, i could not observe any significant difference in > CPU usage due to this test addition. > The difference in the CPU usage was less than standard error. > See 100_bugs-CPU-time.txt attached. > > > > Additionally > > > i've tried to reduce overall number of nodes previously > > > used in this test in a similar way. > > > > Optimizing existing tests isn't an answer to that. We could > > install those optimizations without adding a new test case. > > Oh sure! Here is a separate patch for this optimization: > https://www.postgresql.org/message-id/eb7aa992-c2d7-6ce7-4942-0c784231a362%40inbox.ru > On the contrary with previous case in that one the difference in the CPU usage > during 100_bugs.pl is essential; it decreases approximately by 30%. This CF entry causes tests to fail on all platforms: https://cirrus-ci.com/build/5755408111894528 E.g. https://api.cirrus-ci.com/v1/artifact/task/5298457144459264/testrun/build/testrun/subscription/100_bugs/log/regress_log_100_bugs #### Begin standard error psql:<stdin>:1: NOTICE: dropped replication slot "sub1" on publisher #### End standard error timed out waiting for match: ERROR: relation "error_name" does not exist at character at /tmp/cirrus-ci-build/src/test/subscription/t/100_bugs.plline 115. Greetings, Andres Freund
On 07.12.2022 21:03, Andres Freund wrote: > > This CF entry causes tests to fail on all platforms: > https://cirrus-ci.com/build/5755408111894528 > > E.g. > https://api.cirrus-ci.com/v1/artifact/task/5298457144459264/testrun/build/testrun/subscription/100_bugs/log/regress_log_100_bugs > > #### Begin standard error > psql:<stdin>:1: NOTICE: dropped replication slot "sub1" on publisher > #### End standard error > timed out waiting for match: ERROR: relation "error_name" does not exist at character at /tmp/cirrus-ci-build/src/test/subscription/t/100_bugs.plline 115. > > Greetings, > > Andres Freund Thank you for reminding! There was a conflict when applying v2 on current master. Rebased v3 is attached. Best wishes! -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Sun, 11 Dec 2022 at 09:21, Anton A. Melnikov <aamelnikov@inbox.ru> wrote: > > > On 07.12.2022 21:03, Andres Freund wrote: > > > > > This CF entry causes tests to fail on all platforms: > > https://cirrus-ci.com/build/5755408111894528 > > > > E.g. > > https://api.cirrus-ci.com/v1/artifact/task/5298457144459264/testrun/build/testrun/subscription/100_bugs/log/regress_log_100_bugs > > > > #### Begin standard error > > psql:<stdin>:1: NOTICE: dropped replication slot "sub1" on publisher > > #### End standard error > > timed out waiting for match: ERROR: relation "error_name" does not exist at character at /tmp/cirrus-ci-build/src/test/subscription/t/100_bugs.plline 115. > > > > Greetings, > > > > Andres Freund > > Thank you for reminding! > > There was a conflict when applying v2 on current master. > Rebased v3 is attached. Few suggestions: 1) There is a warning: +# This would crash on the subscriber if not fixed +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); + +my $result = $node_subscriber->wait_for_log( + "ERROR: relation \"error_name\" does not exist at character" +); "my" variable $result masks earlier declaration in same scope at t/100_bugs.pl line 400. You can change: my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); to $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); 2) Now that the crash is fixed, you could change it to a better message: +# This would crash on the subscriber if not fixed +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); + +my $result = $node_subscriber->wait_for_log( + "ERROR: relation \"error_name\" does not exist at character" +); Regards, Vignesh
Thanks for your remarks. On 07.01.2023 15:27, vignesh C wrote: > > Few suggestions: > 1) There is a warning: > +# This would crash on the subscriber if not fixed > +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); > + > +my $result = $node_subscriber->wait_for_log( > + "ERROR: relation \"error_name\" does not exist at character" > +); > > "my" variable $result masks earlier declaration in same scope at > t/100_bugs.pl line 400. > > You can change: > my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); > to > $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); The reason is that the patch fell behind the master. Fixed in v4 together with rebasing on current master. > 2) Now that the crash is fixed, you could change it to a better message: > +# This would crash on the subscriber if not fixed > +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); > + > +my $result = $node_subscriber->wait_for_log( > + "ERROR: relation \"error_name\" does not exist at character" > +); > Tried to make this comment more clear. Best wishes for the new year! -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello! On 15.03.2023 21:29, Gregory Stark (as CFM) wrote: > These patches that are "Needs Review" and have received no comments at > all since before March 1st are these. If your patch is amongst this > list I would suggest any of: > > 1) Move it yourself to the next CF (or withdraw it) > 2) Post to the list with any pending questions asking for specific > feedback -- it's much more likely to get feedback than just a generic > "here's a patch plz review"... > 3) Mark it Ready for Committer and possibly post explaining the > resolution to any earlier questions to make it easier for a committer > to understand the state > There were some remarks: 1) very poor use of test cycles (by Tom Lane) Solved in v2 by adding few extra queries to an existent test without any additional syncing. 2) the patch-tester fails on all platforms (by Andres Freund) Fixed in v3. 3) warning with "my" variable $result and suggestion to correct comment (by vignesh C) Both fixed in v4. Now there are no any pending questions, so moved it to RFC. With the best regards! -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
"Anton A. Melnikov" <aamelnikov@inbox.ru> writes: > Now there are no any pending questions, so moved it to RFC. I did not think this case was worth memorializing in a test before, and I still do not. I'm inclined to reject this patch. regards, tom lane
Hello! On 03.04.2023 21:49, Tom Lane wrote: > "Anton A. Melnikov" <aamelnikov@inbox.ru> writes: >> Now there are no any pending questions, so moved it to RFC. > > I did not think this case was worth memorializing in a test before, > and I still do not. I'm inclined to reject this patch. Earlier, when discussing this test, there was a suggestion like this: > If we were just adding a > query or two to an existing scenario, that could be okay; The current version of the test seems to be satisfies this condition. The queries added do not affect the total test time within the measurement error. This case is rare, of cause, but it really took place in practice. So either there are some more reasons why this test should not be accepted that i do not understand, or i misunderstood something from the above. Could you help me to figure out, please. Would be very grateful. Sincerely yours, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
"Anton A. Melnikov" <aamelnikov@inbox.ru> writes: > On 03.04.2023 21:49, Tom Lane wrote: >> I did not think this case was worth memorializing in a test before, >> and I still do not. I'm inclined to reject this patch. > Could you help me to figure out, please. The problem was an Assert that was speculative when it went in, and which we eventually found was wrong in the context of logical replication. We removed the Assert. I don't think we need a test case to keep us from putting back the Assert. That line of thinking leads to test suites that run for fourteen hours and are near useless because developers can't run them easily. regards, tom lane
On 05.04.2023 17:35, Tom Lane wrote: > "Anton A. Melnikov" <aamelnikov@inbox.ru> writes: >> On 03.04.2023 21:49, Tom Lane wrote: >>> I did not think this case was worth memorializing in a test before, >>> and I still do not. I'm inclined to reject this patch. > >> Could you help me to figure out, please. > > The problem was an Assert that was speculative when it went in, > and which we eventually found was wrong in the context of logical > replication. We removed the Assert. I don't think we need a test > case to keep us from putting back the Assert. That line of thinking > leads to test suites that run for fourteen hours and are near useless > because developers can't run them easily. > > regards, tom lane Ok, i understand! Thanks a lot for the clarification. A rather important point, i'll take it into account for the future. Let's do that. Revoked the patch. With the best wishes! -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company