Re: MultiXact\SLRU buffers configuration - Mailing list pgsql-hackers
From | Andrey Borodin |
---|---|
Subject | Re: MultiXact\SLRU buffers configuration |
Date | |
Msg-id | 65C1B4BA-D16F-4939-978B-AC8F370F5A5E@yandex-team.ru Whole thread Raw |
In response to | Re: MultiXact\SLRU buffers configuration (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: MultiXact\SLRU buffers configuration
|
List | pgsql-hackers |
> 29 окт. 2020 г., в 18:49, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): > > On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote: >> >> >>> 29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): >>> >>> It's not my intention to be mean or anything like that, but to me this >>> means we don't really understand the problem we're trying to solve. Had >>> we understood it, we should be able to construct a workload reproducing >>> the issue ... >>> >>> I understand what the individual patches are doing, and maybe those >>> changes are desirable in general. But without any benchmarks from a >>> plausible workload I find it hard to convince myself that: >>> >>> (a) it actually will help with the issue you're observing on production >>> >>> and >>> (b) it's actually worth the extra complexity (e.g. the lwlock changes) >>> >>> >>> I'm willing to invest some of my time into reviewing/testing this, but I >>> think we badly need better insight into the issue, so that we can build >>> a workload reproducing it. Perhaps collecting some perf profiles and a >>> sample of the queries might help, but I assume you already tried that. >> >> Thanks, Tomas! This totally makes sense. >> >> Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x lessshards), copy of production workload queries. But the problem does not manifest there. >> Why do I think the problem is in MultiXacts? >> Here is a chart with number of wait events on each host >> >> >> During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover toanother node SLRURead continued for a bit there, then disappeared. > > OK, so most of this seems to be due to SLRURead and > MultiXactOffsetControlLock. Could it be that there were too many > multixact members, triggering autovacuum to prevent multixact > wraparound? That might generate a lot of IO on the SLRU. Are you > monitoring the size of the pg_multixact directory? Yes, we had some problems with 'multixact "members" limit exceeded' long time ago. We tuned autovacuum_multixact_freeze_max_age = 200000000 and vacuum_multixact_freeze_table_age = 75000000 (half of defaults)and since then did not ever encounter this problem (~5 months). But the MultiXactOffsetControlLock problem persists. Partially the problem was solved by adding more shards. But when oneof shards encounters a problem it's either MultiXacts or vacuum causing relation truncation (unrelated to this thread). Thanks! Best regards, Andrey Borodin.
pgsql-hackers by date: