Re: MultiXact\SLRU buffers configuration - Mailing list pgsql-hackers

From Gilles Darold
Subject Re: MultiXact\SLRU buffers configuration
Date
Msg-id 6ba7eae2-8b0c-0690-11a5-e921e6586180@darold.net
Whole thread Raw
In response to Re: MultiXact\SLRU buffers configuration  (Andrey Borodin <x4mmm@yandex-team.ru>)
Responses Re: MultiXact\SLRU buffers configuration
List pgsql-hackers
Le 13/11/2020 à 12:49, Andrey Borodin a écrit :
>
>> 10 нояб. 2020 г., в 23:07, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):
>>
>> On 11/10/20 7:16 AM, Andrey Borodin wrote:
>>>
>>> but this picture was not stable.
>>>
>> Seems we haven't made much progress in reproducing the issue :-( I guess
>> we'll need to know more about the machine where this happens. Is there
>> anything special about the hardware/config? Are you monitoring size of
>> the pg_multixact directory?
> It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM.
> PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit
>
> No, unfortunately we do not have signals for SLRU sizes.
> 3.5Tb mdadm raid10 over 28 SSD drives, 82% full.
>
> First incident triggering investigation was on 2020-04-19, at that time cluster was running on PG 10.11. But I think
itwas happening before.
 
>
> I'd say nothing special...
>
>>> How do you collect wait events for aggregation? just insert into some table with cron?
>>>
>> No, I have a simple shell script (attached) sampling data from
>> pg_stat_activity regularly. Then I load it into a table and aggregate to
>> get a summary.
> Thanks!
>
> Best regards, Andrey Borodin.


Hi,


Some time ago I have encountered a contention on 
MultiXactOffsetControlLock with a performances benchmark. Here are the 
wait event monitoring result with a pooling each 10 seconds and a 30 
minutes run for the benchmarl:


  event_type |           event            |   sum
------------+----------------------------+----------
  Client     | ClientRead                 | 44722952
  LWLock     | MultiXactOffsetControlLock | 30343060
  LWLock     | multixact_offset           | 16735250
  LWLock     | MultiXactMemberControlLock |  1601470
  LWLock     | buffer_content             |   991344
  LWLock     | multixact_member           |   805624
  Lock       | transactionid              |   204997
  Activity   | LogicalLauncherMain        |   198834
  Activity   | CheckpointerMain           |   198834
  Activity   | AutoVacuumMain             |   198469
  Activity   | BgWriterMain               |   184066
  Activity   | WalWriterMain              |   171571
  LWLock     | WALWriteLock               |    72428
  IO         | DataFileRead               |    35708
  Activity   | BgWriterHibernate          |    12741
  IO         | SLRURead                   |     9121
  Lock       | relation                   |     8858
  LWLock     | ProcArrayLock              |     7309
  LWLock     | lock_manager               |     6677
  LWLock     | pg_stat_statements         |     4194
  LWLock     | buffer_mapping             |     3222


After reading this thread I change the value of the buffer size to 32 
and 64 and obtain the following results:


  event_type |           event            |    sum
------------+----------------------------+-----------
  Client     | ClientRead                 | 268297572
  LWLock     | MultiXactMemberControlLock |  65162906
  LWLock     | multixact_member           |  33397714
  LWLock     | buffer_content             |   4737065
  Lock       | transactionid              |   2143750
  LWLock     | SubtransControlLock        |   1318230
  LWLock     | WALWriteLock               |   1038999
  Activity   | LogicalLauncherMain        |    940598
  Activity   | AutoVacuumMain             |    938566
  Activity   | CheckpointerMain           |    799052
  Activity   | WalWriterMain              |    749069
  LWLock     | subtrans                   |    710163
  Activity   | BgWriterHibernate          |    536763
  Lock       | object                     |    514225
  Activity   | BgWriterMain               |    394206
  LWLock     | lock_manager               |    295616
  IO         | DataFileRead               |    274236
  LWLock     | ProcArrayLock              |     77099
  Lock       | tuple                      |     59043
  IO         | CopyFileWrite              |     45611
  Lock       | relation                   |     42714

There was still contention on multixact but less than the first run. I 
have increased the buffers to 128 and 512 and obtain the best results 
for this bench:

  event_type |           event            |    sum
------------+----------------------------+-----------
  Client     | ClientRead                 | 160463037
  LWLock     | MultiXactMemberControlLock |   5334188
  LWLock     | buffer_content             |   5228256
  LWLock     | buffer_mapping             |   2368505
  LWLock     | SubtransControlLock        |   2289977
  IPC        | ProcArrayGroupUpdate       |   1560875
  LWLock     | ProcArrayLock              |   1437750
  Lock       | transactionid              |    825561
  LWLock     | subtrans                   |    772701
  LWLock     | WALWriteLock               |    666138
  Activity   | LogicalLauncherMain        |    492585
  Activity   | CheckpointerMain           |    492458
  Activity   | AutoVacuumMain             |    491548
  LWLock     | lock_manager               |    426531
  Lock       | object                     |    403581
  Activity   | WalWriterMain              |    394668
  Activity   | BgWriterHibernate          |    293112
  Activity   | BgWriterMain               |    195312
  LWLock     | MultiXactGenLock           |    177820
  LWLock     | pg_stat_statements         |    173864
  IO         | DataFileRead               |    173009


I hope these metrics can have some interest to show the utility of this 
patch but unfortunately I can not be more precise and provide reports 
for the entire patch. The problem is that this benchmark is run on an 
application that use PostgreSQL 11 and I can not back port the full 
patch, there was too much changes since PG11. I have just increase the 
size of NUM_MXACTOFFSET_BUFFERS and NUM_MXACTMEMBER_BUFFERS. This allow 
us to triple the number of simultaneous connections between the first 
and the last test.


I know that this report is not really helpful but at least I can give 
more information on the benchmark that was used. This is the proprietary 
zRef benchmark which compares the same Cobol programs (transactional and 
batch) executed both on mainframes and on x86 servers. Instead  of a DB2 
z/os database we use PostgreSQL v11. This test has extensive use of 
cursors (each select, even read only, is executed through a cursor) and 
the contention was observed with update on tables with some foreign 
keys. There is no explicit FOR SHARE on the queries, only some FOR 
UPDATE clauses. I guess that the multixact contention is the result of 
the for share locks produced for FK.


So in our case being able to tune the multixact buffers could help a lot 
to improve the performances.


--
Gilles Darold




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] [PATCH] Generic type subscripting
Next
From: Vik Fearing
Date:
Subject: Re: SEARCH and CYCLE clauses