Thread: Unresolved repliaction hang and stop problem.
Hello,
After the 13.2 release, we've upgraded to it and unfortunately this did not solve the issue - the replication still stalls just as described in the original issue. Please advise, how to debug and solve this issue.
After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
We've noticed similar issue being described on this list in
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.
After the 13.2 release, we've upgraded to it and unfortunately this did not solve the issue - the replication still stalls just as described in the original issue.
On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois <krzysztof.kois@unitygroup.com> wrote: > > Hello, > After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general: > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com > We've noticed similar issue being described on this list in > https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html > with a fix being rolled out in 13.2. > The fix for the problem discussed in the above threads is committed only in PG-14, see [1]. I don't know what makes you think it is fixed in 13.2. Also, it is not easy to back-patch that because this fix depends on some of the infrastructure introduced in PG-14. [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc -- With Regards, Amit Kapila.
On 2021-Apr-14, Amit Kapila wrote: > On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois > <krzysztof.kois@unitygroup.com> wrote: > > > > Hello, > > After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general: > > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com > > We've noticed similar issue being described on this list in > > https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html > > with a fix being rolled out in 13.2. > > The fix for the problem discussed in the above threads is committed > only in PG-14, see [1]. I don't know what makes you think it is fixed > in 13.2. Also, it is not easy to back-patch that because this fix > depends on some of the infrastructure introduced in PG-14. > > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc Hmm ... On what does it depend (other than plain git conflicts, which are aplenty)? On a quick look to the commit, it's clear that we need to be careful in order not to cause an ABI break, but that doesn't seem impossible to solve, but I'm wondering if there is more to it than that. -- Álvaro Herrera 39°49'30"S 73°17'W
On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Apr-14, Amit Kapila wrote: > > > On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois > > <krzysztof.kois@unitygroup.com> wrote: > > > > > > Hello, > > > After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general: > > > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com > > > We've noticed similar issue being described on this list in > > > https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html > > > with a fix being rolled out in 13.2. > > > > The fix for the problem discussed in the above threads is committed > > only in PG-14, see [1]. I don't know what makes you think it is fixed > > in 13.2. Also, it is not easy to back-patch that because this fix > > depends on some of the infrastructure introduced in PG-14. > > > > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc > > Hmm ... On what does it depend (other than plain git conflicts, which > are aplenty)? On a quick look to the commit, it's clear that we need to > be careful in order not to cause an ABI break, but that doesn't seem > impossible to solve, but I'm wondering if there is more to it than that. > As mentioned in the commit message, we need another commit [1] change to make this work. [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0 -- With Regards, Amit Kapila.
On 2021-Apr-28, Amit Kapila wrote: > On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Hmm ... On what does it depend (other than plain git conflicts, which > > are aplenty)? On a quick look to the commit, it's clear that we need to > > be careful in order not to cause an ABI break, but that doesn't seem > > impossible to solve, but I'm wondering if there is more to it than that. > > As mentioned in the commit message, we need another commit [1] change > to make this work. > > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0 Oh, yeah, that looks tougher. (Still not impossible: it adds a new WAL message type, but we have added such on a minor release before.) ... It's strange that replication worked for them on pg10 though and broke on 13. What did we change anything to make it so? (I don't have any fish to fry on this topic at present, but it seems a bit concerning.) -- Álvaro Herrera Valdivia, Chile
On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Apr-28, Amit Kapila wrote: > > > On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > Hmm ... On what does it depend (other than plain git conflicts, which > > > are aplenty)? On a quick look to the commit, it's clear that we need to > > > be careful in order not to cause an ABI break, but that doesn't seem > > > impossible to solve, but I'm wondering if there is more to it than that. > > > > As mentioned in the commit message, we need another commit [1] change > > to make this work. > > > > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0 > > Oh, yeah, that looks tougher. (Still not impossible: it adds a new WAL > message type, but we have added such on a minor release before.) > Yeah, we can try to make it possible if it is really a pressing issue but I guess even in that case it is better to do it after we release PG14 so that it can get some more testing. > ... It's strange that replication worked for them on pg10 though and > broke on 13. What did we change anything to make it so? > No idea but probably if the other person can share the exact test case which he sees working fine on PG10 but not on PG13 then it might be a bit easier to investigate. -- With Regards, Amit Kapila.
Cc'ing Lukasz Biegaj because of the pgsql-general thread. On 2021-Apr-29, Amit Kapila wrote: > On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > ... It's strange that replication worked for them on pg10 though and > > broke on 13. What did we change anything to make it so? > > No idea but probably if the other person can share the exact test case > which he sees working fine on PG10 but not on PG13 then it might be a > bit easier to investigate. Ah, noticed now that Krzysztof posted links to these older threads, where a problem is described: https://www.postgresql.org/message-id/flat/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK%2BNqhAWJyw8psQ%40mail.gmail.com https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com Krzysztof said "after upgrading to pg13 we started having problems", which implicitly indicates that the same thing worked well in pg10 --- but if the problem has been correctly identified, then this wouldn't have worked in pg10 either. So something in the story doesn't quite match up. Maybe it's not the same problem after all, or maybe they weren't doing X in pg10 which they are attempting in pg13. Krzysztof, Lukasz, maybe you can describe more? -- Álvaro Herrera Valdivia, Chile
Hey, thanks for reaching out and sorry for the late reply - we had few days of national holidays. On 29.04.2021 15:55, Alvaro Herrera wrote: > https://www.postgresql.org/message-id/flat/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK%2BNqhAWJyw8psQ%40mail.gmail.com > https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com > > Krzysztof said "after upgrading to pg13 we started having problems", > which implicitly indicates that the same thing worked well in pg10 --- > but if the problem has been correctly identified, then this wouldn't > have worked in pg10 either. So something in the story doesn't quite > match up. Maybe it's not the same problem after all, or maybe they > weren't doing X in pg10 which they are attempting in pg13. > The problem started occurring after upgrade from pg10 to pg13. No other changes were performed, especially not within the database structure nor performed operations. The problem is as described in https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com It does occur on two separate production clusters and one test cluster - all belonging to the same customer, although processing slightly different data (it's an e-commerce store with multiple languages and separate production databases for each language). We've tried recreating the database from dump, and recreating the replication, but without any positive effect - the problem persists. We did not rollback the databases to pg10, instead we've stayed with pg13 and implemented a shell script to kill the walsender process if it seems stuck in `hash_seq_search`. It's ugly, but it works and we backup and monitor the data integrity anyway. I'd be happy to help in debugging the issue had I known how to do it :-). If you'd like then we can also try to rollback the installation back to pg10 to get certainty that this was not caused by schema changes. -- Lukasz Biegaj | Unity Group | https://www.unitygroup.com/ System Architect, AWS Certified Solutions Architect
Hi Lukasz, thanks for following up. On 2021-May-04, Lukasz Biegaj wrote: > The problem is as described in https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com > > It does occur on two separate production clusters and one test cluster - all > belonging to the same customer, although processing slightly different data > (it's an e-commerce store with multiple languages and separate production > databases for each language). I think the best next move would be to make certain that the problem is what we think it is, so that we can discuss if Amit's commit is an appropriate fix. I would suggest to do that by running the problematic workload in the test system under "perf record -g" and then get a report with "perf report -g" which should hopefully give enough of a clue. (Sometimes the reports are much better if you use a binary that was compiled with -fno-omit-frame-pointer, so if you're in a position to try that, it might be useful -- or apparently you could try "perf record --call-graph dwarf" or "perf record --call-graph lbr", depending.) Also I would be much more comfortable about proposing to backpatch such an invasive change if you could ensure that in pg10 the same workload does not cause the problem. If it does, then it'd be clear we're talking about a regression. -- Álvaro Herrera Valdivia, Chile "I'm always right, but sometimes I'm more right than other times." (Linus Torvalds)
On 04.05.2021 16:35, Alvaro Herrera wrote: > I would suggest to do that by running the problematic > workload in the test system under "perf record -g" > [..] > you could ensure that in pg10 the same workload > does not cause the problem. We'll go with both propositions. I expect to come back to you with results in about a week or two. -- Lukasz Biegaj | Unity Group | https://www.unitygroup.com/ System Architect, AWS Certified Solutions Architect
śr., 16 cze 2021 o 12:31 Lukasz Biegaj <lukasz.biegaj@unitygroup.com> napisał(a): > On 04.05.2021 16:35, Alvaro Herrera wrote: > > I would suggest to do that by running the problematic > > workload in the test system under "perf record -g" > We'll go with both propositions. I expect to come back to you with > results in about a week or two. Hi Alvaro, We reproduced the replication issue and recorded the walsender process using perf. Below you can find the data for broken and working replication: https://easyupload.io/kxcovg password to the zip file: johS5jeewo Please let me know if you would like us to proceed with the downgrade. -- Hubert Klasa
On 2021-Jun-16, Ha Ka wrote: > śr., 16 cze 2021 o 12:31 Lukasz Biegaj <lukasz.biegaj@unitygroup.com> > napisał(a): > > On 04.05.2021 16:35, Alvaro Herrera wrote: > > > I would suggest to do that by running the problematic > > > workload in the test system under "perf record -g" > > We'll go with both propositions. I expect to come back to you with > > results in about a week or two. > > Hi Alvaro, > We reproduced the replication issue and recorded the walsender process > using perf. Hello, thanks, I downloaded the files but since you sent the perf.data files there's not much I can do to usefully interpret them. Can you please do "perf report -g > perf_report.txt" on each subdir with a perf.data file and upload those text files? (You don't need to rerun the test cases.) Thanks -- Álvaro Herrera Valdivia, Chile "People get annoyed when you try to debug them." (Larry Wall)
> Hello, thanks, I downloaded the files but since you sent the perf.data > files there's not much I can do to usefully interpret them. Can you > please do "perf report -g > perf_report.txt" on each subdir with a > perf.data file and upload those text files? (You don't need to rerun > the test cases.) > Thanks Hi, Here is the upload with generated reports: https://easyupload.io/p38izx passwd: johS5jeewo Regards -- Hubert Klasa
On 2021-Jun-16, Ha Ka wrote: > Here is the upload with generated reports: https://easyupload.io/p38izx > passwd: johS5jeewo OK, so I downloaded that and this is the interesting entry in the profile for the broken case: # Samples: 5K of event 'cpu-clock' # Event count (approx.): 59989898390 # # Children Self Command Shared Object Symbol # ........ ........ ........ ............. .................................. # 100.00% 0.00% postgres postgres [.] exec_replication_command | ---exec_replication_command WalSndLoop XLogSendLogical LogicalDecodingProcessRecord | --99.51%--ReorderBufferQueueChange | |--96.06%--hash_seq_search | |--1.78%--ReorderBufferSerializeTXN | | | --0.52%--errstart | --0.76%--deregister_seq_scan What this tells me is that ReorderBufferQueueChange is spending a lot of time doing hash_seq_search, which probably is the one in ReorderBufferTXNByXid. I have, as yet, no idea what this means. -- Álvaro Herrera 39°49'30"S 73°17'W
At Wed, 16 Jun 2021 18:28:28 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > On 2021-Jun-16, Ha Ka wrote: > # Children Self Command Shared Object Symbol > # ........ ........ ........ ............. .................................. > # > 100.00% 0.00% postgres postgres [.] exec_replication_command > | > ---exec_replication_command > WalSndLoop > XLogSendLogical > LogicalDecodingProcessRecord > | > --99.51%--ReorderBufferQueueChange > | > |--96.06%--hash_seq_search > | > |--1.78%--ReorderBufferSerializeTXN > | | > | --0.52%--errstart > | > --0.76%--deregister_seq_scan > > What this tells me is that ReorderBufferQueueChange is spending a lot of > time doing hash_seq_search, which probably is the one in > ReorderBufferTXNByXid. I don't see a call to hash_*seq*_search there. Instead, I see one in ReorderBufferCheckMemoryLimit(). If added an elog line in hash_seq_search that is visited only when it is called under ReorderBufferQueueChange, then set logical_decoding_work_mem to 64kB. Running the following query calls hash_seq_search (relatively) frequently. pub=# create table t1 (a int primary key); pub=# create publication p1 for table t1; sub=# create table t1 (a int primary key); sub=# create subscription s1 connection 'host=/tmp port=5432' publication p1; pub=# insert into t1 (select a from generate_series(0, 9999) a); The insert above makes 20 calls to ReorderBufferLargestTXN() (via ReorderBufferCheckmemoryLimit()), which loops over hash_seq_search. /* * Find the largest transaction (toplevel or subxact) to evict (spill to disk). * * XXX With many subtransactions this might be quite slow, because we'll have * to walk through all of them. There are some options how we could improve * that: (a) maintain some secondary structure with transactions sorted by * amount of changes, (b) not looking for the entirely largest transaction, * but e.g. for transaction using at least some fraction of the memory limit, * and (c) evicting multiple transactions at once, e.g. to free a given portion * of the memory limit (e.g. 50%). */ static ReorderBufferTXN * ReorderBufferLargestTXN(ReorderBuffer *rb) This looks like a candidate of the culprit. The perf line for "ReorderBufferSerializeTXN supports this hypothesis. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Jun 17, 2021 at 7:28 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 16 Jun 2021 18:28:28 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > > On 2021-Jun-16, Ha Ka wrote: > > # Children Self Command Shared Object Symbol > > # ........ ........ ........ ............. .................................. > > # > > 100.00% 0.00% postgres postgres [.] exec_replication_command > > | > > ---exec_replication_command > > WalSndLoop > > XLogSendLogical > > LogicalDecodingProcessRecord > > | > > --99.51%--ReorderBufferQueueChange > > | > > |--96.06%--hash_seq_search > > | > > |--1.78%--ReorderBufferSerializeTXN > > | | > > | --0.52%--errstart > > | > > --0.76%--deregister_seq_scan > > > > What this tells me is that ReorderBufferQueueChange is spending a lot of > > time doing hash_seq_search, which probably is the one in > > ReorderBufferTXNByXid. > > I don't see a call to hash_*seq*_search there. Instead, I see one in > ReorderBufferCheckMemoryLimit(). > > If added an elog line in hash_seq_search that is visited only when it > is called under ReorderBufferQueueChange, then set > logical_decoding_work_mem to 64kB. > > Running the following query calls hash_seq_search (relatively) frequently. > > pub=# create table t1 (a int primary key); > pub=# create publication p1 for table t1; > sub=# create table t1 (a int primary key); > sub=# create subscription s1 connection 'host=/tmp port=5432' publication p1; > pub=# insert into t1 (select a from generate_series(0, 9999) a); > > The insert above makes 20 calls to ReorderBufferLargestTXN() (via > ReorderBufferCheckmemoryLimit()), which loops over hash_seq_search. > If there are large transactions then someone can probably set logical_decoding_work_mem to a higher value. -- With Regards, Amit Kapila.
On 2021-Jun-17, Kyotaro Horiguchi wrote: > I don't see a call to hash_*seq*_search there. Instead, I see one in > ReorderBufferCheckMemoryLimit(). Doh, of course -- I misread. ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at least we have a reason why this workload regresses in pg13 compared to earlier releases. Looking at the code, it does seem that increasing the memory limit as Amit suggests might solve the issue. Is that a practical workaround? -- Álvaro Herrera Valdivia, Chile
At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > On 2021-Jun-17, Kyotaro Horiguchi wrote: > > > I don't see a call to hash_*seq*_search there. Instead, I see one in > > ReorderBufferCheckMemoryLimit(). > > Doh, of course -- I misread. > > ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at > least we have a reason why this workload regresses in pg13 compared to > earlier releases. > > Looking at the code, it does seem that increasing the memory limit as > Amit suggests might solve the issue. Is that a practical workaround? I believe so generally. I'm not sure about the op, though. Just increasing the working memory to certain size would solve the problem. There might be a potential issue that it might be hard like this case for users to find out that the GUC works for their issue (if actually works). pg_stat_replicatoin_slots.spill_count / spill_txns could be a guide for setting logical_decoding_work_mem. Is it worth having additional explanation like that for the GUC variable in the documentation? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jun 18, 2021 at 11:22 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > > On 2021-Jun-17, Kyotaro Horiguchi wrote: > > > > > I don't see a call to hash_*seq*_search there. Instead, I see one in > > > ReorderBufferCheckMemoryLimit(). > > > > Doh, of course -- I misread. > > > > ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at > > least we have a reason why this workload regresses in pg13 compared to > > earlier releases. > > > > Looking at the code, it does seem that increasing the memory limit as > > Amit suggests might solve the issue. Is that a practical workaround? > > I believe so generally. I'm not sure about the op, though. > > Just increasing the working memory to certain size would solve the > problem. There might be a potential issue that it might be hard like > this case for users to find out that the GUC works for their issue (if > actually works). pg_stat_replicatoin_slots.spill_count / spill_txns > could be a guide for setting logical_decoding_work_mem. Is it worth > having additional explanation like that for the GUC variable in the > documentation? > I don't see any harm in doing that but note that we can update it only for PG-14. The view pg_stat_replicatoin_slots is introduced in PG-14. -- With Regards, Amit Kapila.
sob., 19 cze 2021 o 12:14 Amit Kapila <amit.kapila16@gmail.com> napisał(a): > > On Fri, Jun 18, 2021 at 11:22 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > > > On 2021-Jun-17, Kyotaro Horiguchi wrote: > > > > > > > I don't see a call to hash_*seq*_search there. Instead, I see one in > > > > ReorderBufferCheckMemoryLimit(). > > > > > > Doh, of course -- I misread. > > > > > > ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at > > > least we have a reason why this workload regresses in pg13 compared to > > > earlier releases. > > > > > > Looking at the code, it does seem that increasing the memory limit as > > > Amit suggests might solve the issue. Is that a practical workaround? > > > > I believe so generally. I'm not sure about the op, though. > > > > Just increasing the working memory to certain size would solve the > > problem. There might be a potential issue that it might be hard like > > this case for users to find out that the GUC works for their issue (if > > actually works). pg_stat_replicatoin_slots.spill_count / spill_txns > > could be a guide for setting logical_decoding_work_mem. Is it worth > > having additional explanation like that for the GUC variable in the > > documentation? > > > > I don't see any harm in doing that but note that we can update it only > for PG-14. The view pg_stat_replicatoin_slots is introduced in PG-14. > > -- > With Regards, > Amit Kapila. We increased logical_decoding_work_mem for our production database from 64 to 192 MB and it looks like the issue still persists. The frequency with which replication hangs has remained the same. Do you need any additional perf reports after our change? -- Regards, Hubert Klasa.
On Tue, Aug 10, 2021 at 8:15 PM Ha Ka <klasahubert@gmail.com> wrote: > > sob., 19 cze 2021 o 12:14 Amit Kapila <amit.kapila16@gmail.com> napisał(a): > > We increased logical_decoding_work_mem for our production database > from 64 to 192 MB and it looks like the issue still persists. The > frequency with which replication hangs has remained the same. > Sounds strange. I think one thing to identify at the time slowdown has happened is whether there are a very large number of in-progress transactions at the time slowdown happened. Because the profile shared last time seems to be spending more time in hash_seq_search than in actually serializing the exact. Another possibility to try out for your case is to just always serialize the current xact and see what happens, this might not be an actual solution but can help in diagnosing the problem. > Do you > need any additional perf reports after our change? > It might be good if you can share the WALSender portion of perf as shared in one of the emails above? -- With Regards, Amit Kapila.