Re: MultiXact\SLRU buffers configuration - Mailing list pgsql-hackers
From | Gilles Darold |
---|---|
Subject | Re: MultiXact\SLRU buffers configuration |
Date | |
Msg-id | 6ba7eae2-8b0c-0690-11a5-e921e6586180@darold.net Whole thread Raw |
In response to | Re: MultiXact\SLRU buffers configuration (Andrey Borodin <x4mmm@yandex-team.ru>) |
Responses |
Re: MultiXact\SLRU buffers configuration
|
List | pgsql-hackers |
Le 13/11/2020 à 12:49, Andrey Borodin a écrit : > >> 10 нояб. 2020 г., в 23:07, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а): >> >> On 11/10/20 7:16 AM, Andrey Borodin wrote: >>> >>> but this picture was not stable. >>> >> Seems we haven't made much progress in reproducing the issue :-( I guess >> we'll need to know more about the machine where this happens. Is there >> anything special about the hardware/config? Are you monitoring size of >> the pg_multixact directory? > It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM. > PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit > > No, unfortunately we do not have signals for SLRU sizes. > 3.5Tb mdadm raid10 over 28 SSD drives, 82% full. > > First incident triggering investigation was on 2020-04-19, at that time cluster was running on PG 10.11. But I think itwas happening before. > > I'd say nothing special... > >>> How do you collect wait events for aggregation? just insert into some table with cron? >>> >> No, I have a simple shell script (attached) sampling data from >> pg_stat_activity regularly. Then I load it into a table and aggregate to >> get a summary. > Thanks! > > Best regards, Andrey Borodin. Hi, Some time ago I have encountered a contention on MultiXactOffsetControlLock with a performances benchmark. Here are the wait event monitoring result with a pooling each 10 seconds and a 30 minutes run for the benchmarl: event_type | event | sum ------------+----------------------------+---------- Client | ClientRead | 44722952 LWLock | MultiXactOffsetControlLock | 30343060 LWLock | multixact_offset | 16735250 LWLock | MultiXactMemberControlLock | 1601470 LWLock | buffer_content | 991344 LWLock | multixact_member | 805624 Lock | transactionid | 204997 Activity | LogicalLauncherMain | 198834 Activity | CheckpointerMain | 198834 Activity | AutoVacuumMain | 198469 Activity | BgWriterMain | 184066 Activity | WalWriterMain | 171571 LWLock | WALWriteLock | 72428 IO | DataFileRead | 35708 Activity | BgWriterHibernate | 12741 IO | SLRURead | 9121 Lock | relation | 8858 LWLock | ProcArrayLock | 7309 LWLock | lock_manager | 6677 LWLock | pg_stat_statements | 4194 LWLock | buffer_mapping | 3222 After reading this thread I change the value of the buffer size to 32 and 64 and obtain the following results: event_type | event | sum ------------+----------------------------+----------- Client | ClientRead | 268297572 LWLock | MultiXactMemberControlLock | 65162906 LWLock | multixact_member | 33397714 LWLock | buffer_content | 4737065 Lock | transactionid | 2143750 LWLock | SubtransControlLock | 1318230 LWLock | WALWriteLock | 1038999 Activity | LogicalLauncherMain | 940598 Activity | AutoVacuumMain | 938566 Activity | CheckpointerMain | 799052 Activity | WalWriterMain | 749069 LWLock | subtrans | 710163 Activity | BgWriterHibernate | 536763 Lock | object | 514225 Activity | BgWriterMain | 394206 LWLock | lock_manager | 295616 IO | DataFileRead | 274236 LWLock | ProcArrayLock | 77099 Lock | tuple | 59043 IO | CopyFileWrite | 45611 Lock | relation | 42714 There was still contention on multixact but less than the first run. I have increased the buffers to 128 and 512 and obtain the best results for this bench: event_type | event | sum ------------+----------------------------+----------- Client | ClientRead | 160463037 LWLock | MultiXactMemberControlLock | 5334188 LWLock | buffer_content | 5228256 LWLock | buffer_mapping | 2368505 LWLock | SubtransControlLock | 2289977 IPC | ProcArrayGroupUpdate | 1560875 LWLock | ProcArrayLock | 1437750 Lock | transactionid | 825561 LWLock | subtrans | 772701 LWLock | WALWriteLock | 666138 Activity | LogicalLauncherMain | 492585 Activity | CheckpointerMain | 492458 Activity | AutoVacuumMain | 491548 LWLock | lock_manager | 426531 Lock | object | 403581 Activity | WalWriterMain | 394668 Activity | BgWriterHibernate | 293112 Activity | BgWriterMain | 195312 LWLock | MultiXactGenLock | 177820 LWLock | pg_stat_statements | 173864 IO | DataFileRead | 173009 I hope these metrics can have some interest to show the utility of this patch but unfortunately I can not be more precise and provide reports for the entire patch. The problem is that this benchmark is run on an application that use PostgreSQL 11 and I can not back port the full patch, there was too much changes since PG11. I have just increase the size of NUM_MXACTOFFSET_BUFFERS and NUM_MXACTMEMBER_BUFFERS. This allow us to triple the number of simultaneous connections between the first and the last test. I know that this report is not really helpful but at least I can give more information on the benchmark that was used. This is the proprietary zRef benchmark which compares the same Cobol programs (transactional and batch) executed both on mainframes and on x86 servers. Instead of a DB2 z/os database we use PostgreSQL v11. This test has extensive use of cursors (each select, even read only, is executed through a cursor) and the contention was observed with update on tables with some foreign keys. There is no explicit FOR SHARE on the queries, only some FOR UPDATE clauses. I guess that the multixact contention is the result of the for share locks produced for FK. So in our case being able to tune the multixact buffers could help a lot to improve the performances. -- Gilles Darold
pgsql-hackers by date: