Home > mailing lists

Thread: Unresolved repliaction hang and stop problem.

Unresolved repliaction hang and stop problem.

From

Krzysztof Kois

Date:

13 April 2021, 07:36:33

Hello,

After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:

https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com

We've noticed similar issue being described on this list in
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.

After the 13.2 release, we've upgraded to it and unfortunately this did not solve the issue - the replication still stalls just as described in the original issue.

Please advise, how to debug and solve this issue.

Re: Unresolved repliaction hang and stop problem.

From

Amit Kapila

Date:

14 April 2021, 10:19:51

On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
<krzysztof.kois@unitygroup.com> wrote:
>
> Hello,
> After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
> https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> We've noticed similar issue being described on this list in
>
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> with a fix being rolled out in 13.2.
>

The fix for the problem discussed in the above threads is committed
only in PG-14, see [1]. I don't know what makes you think it is fixed
in 13.2. Also, it is not easy to back-patch that because this fix
depends on some of the infrastructure introduced in PG-14.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

-- 
With Regards,
Amit Kapila.

Re: Unresolved repliaction hang and stop problem.

From

Alvaro Herrera

Date:

28 April 2021, 01:18:12

On 2021-Apr-14, Amit Kapila wrote:

> On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
> <krzysztof.kois@unitygroup.com> wrote:
> >
> > Hello,
> > After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
> > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> > We've noticed similar issue being described on this list in
> >
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> > with a fix being rolled out in 13.2.
> 
> The fix for the problem discussed in the above threads is committed
> only in PG-14, see [1]. I don't know what makes you think it is fixed
> in 13.2. Also, it is not easy to back-patch that because this fix
> depends on some of the infrastructure introduced in PG-14.
> 
> [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

Hmm ... On what does it depend (other than plain git conflicts, which
are aplenty)?  On a quick look to the commit, it's clear that we need to
be careful in order not to cause an ABI break, but that doesn't seem
impossible to solve, but I'm wondering if there is more to it than that.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W

Re: Unresolved repliaction hang and stop problem.

From

Amit Kapila

Date:

28 April 2021, 03:12:56

On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Apr-14, Amit Kapila wrote:
>
> > On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
> > <krzysztof.kois@unitygroup.com> wrote:
> > >
> > > Hello,
> > > After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
> > > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> > > We've noticed similar issue being described on this list in
> > >
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> > > with a fix being rolled out in 13.2.
> >
> > The fix for the problem discussed in the above threads is committed
> > only in PG-14, see [1]. I don't know what makes you think it is fixed
> > in 13.2. Also, it is not easy to back-patch that because this fix
> > depends on some of the infrastructure introduced in PG-14.
> >
> > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc
>
> Hmm ... On what does it depend (other than plain git conflicts, which
> are aplenty)?  On a quick look to the commit, it's clear that we need to
> be careful in order not to cause an ABI break, but that doesn't seem
> impossible to solve, but I'm wondering if there is more to it than that.
>

As mentioned in the commit message, we need another commit [1] change
to make this work.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

-- 
With Regards,
Amit Kapila.

Re: Unresolved repliaction hang and stop problem.

From

Alvaro Herrera

Date:

28 April 2021, 14:06:36

On 2021-Apr-28, Amit Kapila wrote:

> On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Hmm ... On what does it depend (other than plain git conflicts, which
> > are aplenty)?  On a quick look to the commit, it's clear that we need to
> > be careful in order not to cause an ABI break, but that doesn't seem
> > impossible to solve, but I'm wondering if there is more to it than that.
> 
> As mentioned in the commit message, we need another commit [1] change
> to make this work.
> 
> [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

Oh, yeah, that looks tougher.  (Still not impossible: it adds a new WAL
message type, but we have added such on a minor release before.)

... It's strange that replication worked for them on pg10 though and
broke on 13.  What did we change anything to make it so?  (I don't have
any fish to fry on this topic at present, but it seems a bit
concerning.)

-- 
Álvaro Herrera       Valdivia, Chile

Re: Unresolved repliaction hang and stop problem.

From

Amit Kapila

Date:

29 April 2021, 05:14:14

On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Apr-28, Amit Kapila wrote:
>
> > On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > > Hmm ... On what does it depend (other than plain git conflicts, which
> > > are aplenty)?  On a quick look to the commit, it's clear that we need to
> > > be careful in order not to cause an ABI break, but that doesn't seem
> > > impossible to solve, but I'm wondering if there is more to it than that.
> >
> > As mentioned in the commit message, we need another commit [1] change
> > to make this work.
> >
> > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0
>
> Oh, yeah, that looks tougher.  (Still not impossible: it adds a new WAL
> message type, but we have added such on a minor release before.)
>

Yeah, we can try to make it possible if it is really a pressing issue
but I guess even in that case it is better to do it after we release
PG14 so that it can get some more testing.

> ... It's strange that replication worked for them on pg10 though and
> broke on 13.  What did we change anything to make it so?
>

No idea but probably if the other person can share the exact test case
which he sees working fine on PG10 but not on PG13 then it might be a
bit easier to investigate.

-- 
With Regards,
Amit Kapila.

Re: Unresolved repliaction hang and stop problem.

From

Alvaro Herrera

Date:

29 April 2021, 13:55:43

Cc'ing Lukasz Biegaj because of the pgsql-general thread.

On 2021-Apr-29, Amit Kapila wrote:

> On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > ... It's strange that replication worked for them on pg10 though and
> > broke on 13.  What did we change anything to make it so?
> 
> No idea but probably if the other person can share the exact test case
> which he sees working fine on PG10 but not on PG13 then it might be a
> bit easier to investigate.

Ah, noticed now that Krzysztof posted links to these older threads,
where a problem is described:

https://www.postgresql.org/message-id/flat/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK%2BNqhAWJyw8psQ%40mail.gmail.com
https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com

Krzysztof said "after upgrading to pg13 we started having problems",
which implicitly indicates that the same thing worked well in pg10 ---
but if the problem has been correctly identified, then this wouldn't
have worked in pg10 either.  So something in the story doesn't quite
match up.  Maybe it's not the same problem after all, or maybe they
weren't doing X in pg10 which they are attempting in pg13.

Krzysztof, Lukasz, maybe you can describe more?

-- 
Álvaro Herrera       Valdivia, Chile

Re: Unresolved repliaction hang and stop problem.

From

Lukasz Biegaj

Date:

04 May 2021, 13:21:07

Hey, thanks for reaching out and sorry for the late reply - we had few 
days of national holidays.

On 29.04.2021 15:55, Alvaro Herrera wrote:
> https://www.postgresql.org/message-id/flat/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK%2BNqhAWJyw8psQ%40mail.gmail.com
> https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> 
> Krzysztof said "after upgrading to pg13 we started having problems",
> which implicitly indicates that the same thing worked well in pg10 ---
> but if the problem has been correctly identified, then this wouldn't
> have worked in pg10 either.  So something in the story doesn't quite
> match up.  Maybe it's not the same problem after all, or maybe they
> weren't doing X in pg10 which they are attempting in pg13.
> 

The problem started occurring after upgrade from pg10 to pg13. No other 
changes were performed, especially not within the database structure nor 
performed operations.

The problem is as described in 
https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com

It does occur on two separate production clusters and one test cluster - 
all belonging to the same customer, although processing slightly 
different data (it's an e-commerce store with multiple languages and 
separate production databases for each language).

We've tried recreating the database from dump, and recreating the 
replication, but without any positive effect - the problem persists.

We did not rollback the databases to pg10, instead we've stayed with 
pg13 and implemented a shell script to kill the walsender process if it 
seems stuck in `hash_seq_search`. It's ugly, but it works and we backup 
and monitor the data integrity anyway.

I'd be happy to help in debugging the issue had I known how to do it 
:-). If you'd like then we can also try to rollback the installation 
back to pg10 to get certainty that this was not caused by schema changes.

-- 
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Certified Solutions Architect

Re: Unresolved repliaction hang and stop problem.

From

Alvaro Herrera

Date:

04 May 2021, 14:35:05

Hi Lukasz, thanks for following up.

On 2021-May-04, Lukasz Biegaj wrote:

> The problem is as described in
https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> 
> It does occur on two separate production clusters and one test cluster - all
> belonging to the same customer, although processing slightly different data
> (it's an e-commerce store with multiple languages and separate production
> databases for each language).

I think the best next move would be to make certain that the problem is
what we think it is, so that we can discuss if Amit's commit is an
appropriate fix.  I would suggest to do that by running the problematic
workload in the test system under "perf record -g" and then get a report
with "perf report -g" which should hopefully give enough of a clue.
(Sometimes the reports are much better if you use a binary that was
compiled with -fno-omit-frame-pointer, so if you're in a position to try
that, it might be useful -- or apparently you could try "perf record
--call-graph dwarf" or "perf record --call-graph lbr", depending.)

Also I would be much more comfortable about proposing to backpatch such
an invasive change if you could ensure that in pg10 the same workload
does not cause the problem.  If it does, then it'd be clear we're
talking about a regression.

-- 
Álvaro Herrera       Valdivia, Chile
"I'm always right, but sometimes I'm more right than other times."
                                                  (Linus Torvalds)

Re: Unresolved repliaction hang and stop problem.

From

Lukasz Biegaj

Date:

06 May 2021, 08:35:16

On 04.05.2021 16:35, Alvaro Herrera wrote:
> I would suggest to do that by running the problematic
> workload in the test system under "perf record -g"
 > [..]
 >  you could ensure that in pg10 the same workload
 > does not cause the problem.

We'll go with both propositions. I expect to come back to you with 
results in about a week or two.

-- 
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Certified Solutions Architect

Re: Unresolved repliaction hang and stop problem.

From

Ha Ka

Date:

16 June 2021, 10:53:00

śr., 16 cze 2021 o 12:31 Lukasz Biegaj <lukasz.biegaj@unitygroup.com>
napisał(a):
> On 04.05.2021 16:35, Alvaro Herrera wrote:
> > I would suggest to do that by running the problematic
> > workload in the test system under "perf record -g"
> We'll go with both propositions. I expect to come back to you with
> results in about a week or two.

Hi Alvaro,
We reproduced the replication issue and recorded the walsender process
using perf.
Below you can find the data for broken and working replication:

https://easyupload.io/kxcovg

password to the zip file:  johS5jeewo

Please let me know if you would like us to proceed with the downgrade.

--
Hubert Klasa

Re: Unresolved repliaction hang and stop problem.

From

Alvaro Herrera

Date:

16 June 2021, 13:33:38

On 2021-Jun-16, Ha Ka wrote:

> śr., 16 cze 2021 o 12:31 Lukasz Biegaj <lukasz.biegaj@unitygroup.com>
> napisał(a):
> > On 04.05.2021 16:35, Alvaro Herrera wrote:
> > > I would suggest to do that by running the problematic
> > > workload in the test system under "perf record -g"
> > We'll go with both propositions. I expect to come back to you with
> > results in about a week or two.
> 
> Hi Alvaro,
> We reproduced the replication issue and recorded the walsender process
> using perf.

Hello, thanks, I downloaded the files but since you sent the perf.data
files there's not much I can do to usefully interpret them.  Can you
please do "perf report -g > perf_report.txt" on each subdir with a
perf.data file and upload those text files?  (You don't need to rerun
the test cases.)

Thanks

-- 
Álvaro Herrera       Valdivia, Chile
"People get annoyed when you try to debug them."  (Larry Wall)

Re: Unresolved repliaction hang and stop problem.

From

Ha Ka

Date:

16 June 2021, 15:15:04

> Hello, thanks, I downloaded the files but since you sent the perf.data
> files there's not much I can do to usefully interpret them.  Can you
> please do "perf report -g > perf_report.txt" on each subdir with a
> perf.data file and upload those text files?  (You don't need to rerun
> the test cases.)
> Thanks

Hi,
Here is the upload with generated reports: https://easyupload.io/p38izx
passwd:  johS5jeewo

Regards

-- 
Hubert Klasa

Re: Unresolved repliaction hang and stop problem.

From

Alvaro Herrera

Date:

16 June 2021, 22:28:28

On 2021-Jun-16, Ha Ka wrote:

> Here is the upload with generated reports: https://easyupload.io/p38izx
> passwd:  johS5jeewo

OK, so I downloaded that and this is the interesting entry in the
profile for the broken case:

# Samples: 5K of event 'cpu-clock'
# Event count (approx.): 59989898390
#
# Children      Self  Command   Shared Object  Symbol                            
# ........  ........  ........  .............  ..................................
#
   100.00%     0.00%  postgres  postgres       [.] exec_replication_command
            |
            ---exec_replication_command
               WalSndLoop
               XLogSendLogical
               LogicalDecodingProcessRecord
               |          
                --99.51%--ReorderBufferQueueChange
                          |          
                          |--96.06%--hash_seq_search
                          |          
                          |--1.78%--ReorderBufferSerializeTXN
                          |          |          
                          |           --0.52%--errstart
                          |          
                           --0.76%--deregister_seq_scan

What this tells me is that ReorderBufferQueueChange is spending a lot of
time doing hash_seq_search, which probably is the one in
ReorderBufferTXNByXid.

I have, as yet, no idea what this means.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W

Re: Unresolved repliaction hang and stop problem.

From

Kyotaro Horiguchi

Date:

17 June 2021, 01:58:44

At Wed, 16 Jun 2021 18:28:28 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> On 2021-Jun-16, Ha Ka wrote:
> # Children      Self  Command   Shared Object  Symbol                            
> # ........  ........  ........  .............  ..................................
> #
>    100.00%     0.00%  postgres  postgres       [.] exec_replication_command
>             |
>             ---exec_replication_command
>                WalSndLoop
>                XLogSendLogical
>                LogicalDecodingProcessRecord
>                |          
>                 --99.51%--ReorderBufferQueueChange
>                           |          
>                           |--96.06%--hash_seq_search
>                           |          
>                           |--1.78%--ReorderBufferSerializeTXN
>                           |          |          
>                           |           --0.52%--errstart
>                           |          
>                            --0.76%--deregister_seq_scan
> 
> What this tells me is that ReorderBufferQueueChange is spending a lot of
> time doing hash_seq_search, which probably is the one in
> ReorderBufferTXNByXid.

I don't see a call to hash_*seq*_search there. Instead, I see one in
ReorderBufferCheckMemoryLimit().

If added an elog line in hash_seq_search that is visited only when it
is called under ReorderBufferQueueChange, then set
logical_decoding_work_mem to 64kB.

Running the following query calls hash_seq_search (relatively) frequently.

pub=# create table t1 (a int primary key);
pub=# create publication p1 for table t1;
sub=# create table t1 (a int primary key);
sub=# create subscription s1 connection 'host=/tmp port=5432' publication p1;
pub=# insert into t1 (select a from generate_series(0, 9999) a);

The insert above makes 20 calls to ReorderBufferLargestTXN() (via
ReorderBufferCheckmemoryLimit()), which loops over hash_seq_search.

/*
 * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
 *
 * XXX With many subtransactions this might be quite slow, because we'll have
 * to walk through all of them. There are some options how we could improve
 * that: (a) maintain some secondary structure with transactions sorted by
 * amount of changes, (b) not looking for the entirely largest transaction,
 * but e.g. for transaction using at least some fraction of the memory limit,
 * and (c) evicting multiple transactions at once, e.g. to free a given portion
 * of the memory limit (e.g. 50%).
 */
static ReorderBufferTXN *
ReorderBufferLargestTXN(ReorderBuffer *rb)

This looks like a candidate of the culprit. The perf line for
"ReorderBufferSerializeTXN supports this hypothesis.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Unresolved repliaction hang and stop problem.

From

Amit Kapila

Date:

17 June 2021, 11:37:01

On Thu, Jun 17, 2021 at 7:28 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 16 Jun 2021 18:28:28 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
> > On 2021-Jun-16, Ha Ka wrote:
> > # Children      Self  Command   Shared Object  Symbol
> > # ........  ........  ........  .............  ..................................
> > #
> >    100.00%     0.00%  postgres  postgres       [.] exec_replication_command
> >             |
> >             ---exec_replication_command
> >                WalSndLoop
> >                XLogSendLogical
> >                LogicalDecodingProcessRecord
> >                |
> >                 --99.51%--ReorderBufferQueueChange
> >                           |
> >                           |--96.06%--hash_seq_search
> >                           |
> >                           |--1.78%--ReorderBufferSerializeTXN
> >                           |          |
> >                           |           --0.52%--errstart
> >                           |
> >                            --0.76%--deregister_seq_scan
> >
> > What this tells me is that ReorderBufferQueueChange is spending a lot of
> > time doing hash_seq_search, which probably is the one in
> > ReorderBufferTXNByXid.
>
> I don't see a call to hash_*seq*_search there. Instead, I see one in
> ReorderBufferCheckMemoryLimit().
>
> If added an elog line in hash_seq_search that is visited only when it
> is called under ReorderBufferQueueChange, then set
> logical_decoding_work_mem to 64kB.
>
> Running the following query calls hash_seq_search (relatively) frequently.
>
> pub=# create table t1 (a int primary key);
> pub=# create publication p1 for table t1;
> sub=# create table t1 (a int primary key);
> sub=# create subscription s1 connection 'host=/tmp port=5432' publication p1;
> pub=# insert into t1 (select a from generate_series(0, 9999) a);
>
> The insert above makes 20 calls to ReorderBufferLargestTXN() (via
> ReorderBufferCheckmemoryLimit()), which loops over hash_seq_search.
>

If there are large transactions then someone can probably set
logical_decoding_work_mem to a higher value.

-- 
With Regards,
Amit Kapila.

Re: Unresolved repliaction hang and stop problem.

From

Alvaro Herrera

Date:

17 June 2021, 16:56:42

On 2021-Jun-17, Kyotaro Horiguchi wrote:

> I don't see a call to hash_*seq*_search there. Instead, I see one in
> ReorderBufferCheckMemoryLimit().

Doh, of course -- I misread.

ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at
least we have a reason why this workload regresses in pg13 compared to
earlier releases.

Looking at the code, it does seem that increasing the memory limit as
Amit suggests might solve the issue.  Is that a practical workaround?

-- 
Álvaro Herrera       Valdivia, Chile

Re: Unresolved repliaction hang and stop problem.

From

Kyotaro Horiguchi

Date:

18 June 2021, 05:52:23

At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> On 2021-Jun-17, Kyotaro Horiguchi wrote:
> 
> > I don't see a call to hash_*seq*_search there. Instead, I see one in
> > ReorderBufferCheckMemoryLimit().
> 
> Doh, of course -- I misread.
> 
> ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at
> least we have a reason why this workload regresses in pg13 compared to
> earlier releases.
> 
> Looking at the code, it does seem that increasing the memory limit as
> Amit suggests might solve the issue.  Is that a practical workaround?

I believe so generally. I'm not sure about the op, though.

Just increasing the working memory to certain size would solve the
problem.  There might be a potential issue that it might be hard like
this case for users to find out that the GUC works for their issue (if
actually works).  pg_stat_replicatoin_slots.spill_count / spill_txns
could be a guide for setting logical_decoding_work_mem.  Is it worth
having additional explanation like that for the GUC variable in the
documentation?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Unresolved repliaction hang and stop problem.

From

Amit Kapila

Date:

19 June 2021, 10:14:19

On Fri, Jun 18, 2021 at 11:22 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
> > On 2021-Jun-17, Kyotaro Horiguchi wrote:
> >
> > > I don't see a call to hash_*seq*_search there. Instead, I see one in
> > > ReorderBufferCheckMemoryLimit().
> >
> > Doh, of course -- I misread.
> >
> > ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at
> > least we have a reason why this workload regresses in pg13 compared to
> > earlier releases.
> >
> > Looking at the code, it does seem that increasing the memory limit as
> > Amit suggests might solve the issue.  Is that a practical workaround?
>
> I believe so generally. I'm not sure about the op, though.
>
> Just increasing the working memory to certain size would solve the
> problem.  There might be a potential issue that it might be hard like
> this case for users to find out that the GUC works for their issue (if
> actually works).  pg_stat_replicatoin_slots.spill_count / spill_txns
> could be a guide for setting logical_decoding_work_mem.  Is it worth
> having additional explanation like that for the GUC variable in the
> documentation?
>

I don't see any harm in doing that but note that we can update it only
for PG-14. The view pg_stat_replicatoin_slots is introduced in PG-14.

-- 
With Regards,
Amit Kapila.

Re: Unresolved repliaction hang and stop problem.

From

Ha Ka

Date:

10 August 2021, 14:45:23

sob., 19 cze 2021 o 12:14 Amit Kapila <amit.kapila16@gmail.com> napisał(a):
>
> On Fri, Jun 18, 2021 at 11:22 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
> > > On 2021-Jun-17, Kyotaro Horiguchi wrote:
> > >
> > > > I don't see a call to hash_*seq*_search there. Instead, I see one in
> > > > ReorderBufferCheckMemoryLimit().
> > >
> > > Doh, of course -- I misread.
> > >
> > > ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at
> > > least we have a reason why this workload regresses in pg13 compared to
> > > earlier releases.
> > >
> > > Looking at the code, it does seem that increasing the memory limit as
> > > Amit suggests might solve the issue.  Is that a practical workaround?
> >
> > I believe so generally. I'm not sure about the op, though.
> >
> > Just increasing the working memory to certain size would solve the
> > problem.  There might be a potential issue that it might be hard like
> > this case for users to find out that the GUC works for their issue (if
> > actually works).  pg_stat_replicatoin_slots.spill_count / spill_txns
> > could be a guide for setting logical_decoding_work_mem.  Is it worth
> > having additional explanation like that for the GUC variable in the
> > documentation?
> >
>
> I don't see any harm in doing that but note that we can update it only
> for PG-14. The view pg_stat_replicatoin_slots is introduced in PG-14.
>
> --
> With Regards,
> Amit Kapila.

We increased logical_decoding_work_mem for our production database
from 64 to 192 MB and it looks like the issue still persists. The
frequency with which replication hangs has remained the same. Do you
need any additional perf reports after our change?

--
Regards,
Hubert Klasa.

Re: Unresolved repliaction hang and stop problem.

From

Amit Kapila

Date:

11 August 2021, 06:34:19

On Tue, Aug 10, 2021 at 8:15 PM Ha Ka <klasahubert@gmail.com> wrote:
>
> sob., 19 cze 2021 o 12:14 Amit Kapila <amit.kapila16@gmail.com> napisał(a):
>
> We increased logical_decoding_work_mem for our production database
> from 64 to 192 MB and it looks like the issue still persists. The
> frequency with which replication hangs has remained the same.
>

Sounds strange. I think one thing to identify at the time slowdown has
happened is whether there are a very large number of in-progress
transactions at the time slowdown happened. Because the profile shared
last time seems to be spending more time in hash_seq_search than in
actually serializing the exact. Another possibility to try out for
your case is to just always serialize the current xact and see what
happens, this might not be an actual solution but can help in
diagnosing the problem.

> Do you
> need any additional perf reports after our change?
>

It might be good if you can share the WALSender portion of perf as
shared in one of the emails above?

--
With Regards,
Amit Kapila.