Re: MultiXact\SLRU buffers configuration - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: MultiXact\SLRU buffers configuration
Date
Msg-id 20201029134933.xd4mh2cofuf6tdfz@development
Whole thread Raw
In response to Re: MultiXact\SLRU buffers configuration  (Andrey Borodin <x4mmm@yandex-team.ru>)
Responses Re: MultiXact\SLRU buffers configuration
List pgsql-hackers
On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote:
>
>
>> 29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>>
>> It's not my intention to be mean or anything like that, but to me this
>> means we don't really understand the problem we're trying to solve. Had
>> we understood it, we should be able to construct a workload reproducing
>> the issue ...
>>
>> I understand what the individual patches are doing, and maybe those
>> changes are desirable in general. But without any benchmarks from a
>> plausible workload I find it hard to convince myself that:
>>
>> (a) it actually will help with the issue you're observing on production
>>
>> and
>> (b) it's actually worth the extra complexity (e.g. the lwlock changes)
>>
>>
>> I'm willing to invest some of my time into reviewing/testing this, but I
>> think we badly need better insight into the issue, so that we can build
>> a workload reproducing it. Perhaps collecting some perf profiles and a
>> sample of the queries might help, but I assume you already tried that.
>
>Thanks, Tomas! This totally makes sense.
>
>Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x less
shards),copy of production workload queries. But the problem does not manifest there.
 
>Why do I think the problem is in MultiXacts?
>Here is a chart with number of wait events on each host
>
>
>During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover to
anothernode SLRURead continued for a bit there, then disappeared.
 

OK, so most of this seems to be due to SLRURead and
MultiXactOffsetControlLock. Could it be that there were too many
multixact members, triggering autovacuum to prevent multixact
wraparound? That might generate a lot of IO on the SLRU. Are you
monitoring the size of the pg_multixact directory?

>Backtraces on standbys during the problem show that most of backends are sleeping in pg_sleep(1000L) and are not
includedinto wait stats on these charts.
 
>
>Currently I'm considering writing test that directly calls MultiXactIdExpand(), MultiXactIdCreate(), and
GetMultiXactIdMembers()from an extension. How do you think, would benchmarks in such tests be meaningful?
 
>

I don't know. I'd much rather have a SQL-level benchmark than an
extension doing this kind of stuff.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Re: libpq compression
Next
From: Stephen Frost
Date:
Subject: Re: Log message for GSS connection is missing once connection authorization is successful.