Home > mailing lists

Re: MultiXact\SLRU buffers configuration - Mailing list pgsql-hackers

From	Andrey Borodin
Subject	Re: MultiXact\SLRU buffers configuration
Date	October 29, 2020 10:08:21
Msg-id	43F3DE92-F236-4EA5-B4D6-39BEF6BD849D@yandex-team.ru Whole thread Raw
In response to	Re: MultiXact\SLRU buffers configuration (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: MultiXact\SLRU buffers configuration
List	pgsql-hackers

Tree view

29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):

It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...

I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:

(a) it actually will help with the issue you're observing on production

and
(b) it's actually worth the extra complexity (e.g. the lwlock changes)

I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.

Thanks, Tomas! This totally makes sense.

Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x less shards), copy of production workload queries. But the problem does not manifest there.

Why do I think the problem is in MultiXacts?

Here is a chart with number of wait events on each host

During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover to another node SLRURead continued for a bit there, then disappeared.

Backtraces on standbys during the problem show that most of backends are sleeping in pg_sleep(1000L) and are not included into wait stats on these charts.

Currently I'm considering writing test that directly calls MultiXactIdExpand(), MultiXactIdCreate(), and GetMultiXactIdMembers() from an extension. How do you think, would benchmarks in such tests be meaningful?

Thanks!

Best regards, Andrey Borodin.

Attachment

%413%440%430%444%438%43A%430-1.png

pgsql-hackers by date:

From: Bharath Rupireddy
Date: 29 October 2020, 09:45:04
Subject: Re: Log message for GSS connection is missing once connection authorization is successful.

From: "bucoo@sohu.com"
Date: 29 October 2020, 10:23:25
Subject: Re: Re: parallel distinct union and aggregate support patch

Re: MultiXact\SLRU buffers configuration - Mailing list pgsql-hackers

Attachment

Previous

Next