Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id 91d6d0f9eae224d690833bc859753f79e36f2f3a.camel@postgrespro.ru
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Yura Sokolov <y.sokolov@postgrespro.ru>)
Responses Re: BufferAlloc: don't take two simultaneous locks
List pgsql-hackers
В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > > I'd like to ask you to remove nalloced from partitions then add a
> > > > global atomic for the same use?
> > > 
> > > I really believe it should be global. I made it per-partition to
> > > not overcomplicate first versions. Glad you tell it.
> > > 
> > > I thought to protect it with freeList[0].mutex, but probably atomic
> > > is better idea here. But which atomic to chose: uint64 or uint32?
> > > Based on sizeof(long)?
> > > Ok, I'll do in next version.
> > 
> > Current nentries is a long (= int64 on CentOS). And uint32 can support
> > roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
> > enough.  So it would be uint64.
> > 
> > > Whole get_hash_entry look strange.
> > > Doesn't it better to cycle through partitions and only then go to
> > > get_hash_entry?
> > > May be there should be bitmap for non-empty free lists? 32bit for
> > > 32 partitions. But wouldn't bitmap became contention point itself?
> > 
> > The code puts significance on avoiding contention caused by visiting
> > freelists of other partitions.  And perhaps thinks that freelist
> > shortage rarely happen.
> > 
> > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > 128kB shared buffers and I saw that get_hash_entry never takes the
> > !element_alloc() path and always allocate a fresh entry, then
> > saturates at 30 new elements allocated at the medium of a 100 seconds
> > run.
> > 
> > Then, I tried the same with the patch, and I am surprized to see that
> > the rise of the number of newly allocated elements didn't stop and
> > went up to 511 elements after the 100 seconds run.  So I found that my
> > concern was valid.  The change in dynahash actually
> > continuously/repeatedly causes lack of free list entries.  I'm not
> > sure how much the impact given on performance if we change
> > get_hash_entry to prefer other freelists, though.
> 
> Well, it is quite strange SharedBufHash is not allocated as
> HASH_FIXED_SIZE. Could you check what happens with this flag set?
> I'll try as well.
> 
> Other way to reduce observed case is to remember freelist_idx for
> reused entry. I didn't believe it matters much since entries migrated
> netherless, but probably due to some hot buffers there are tention to
> crowd particular freelist.

Well, I did both. Everything looks ok.

> > By the way, there's the following comment in StrategyInitalize.
> > 
> > >        * Initialize the shared buffer lookup hashtable.
> > >        *
> > >        * Since we can't tolerate running out of lookup table entries, we must be
> > >        * sure to specify an adequate table size here.  The maximum steady-state
> > >        * usage is of course NBuffers entries, but BufferAlloc() tries to insert
> > >        * a new entry before deleting the old.  In principle this could be
> > >        * happening in each partition concurrently, so we could need as many as
> > >        * NBuffers + NUM_BUFFER_PARTITIONS entries.
> > >        */
> > >       InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
> > 
> > "but BufferAlloc() tries to insert a new entry before deleting the
> > old." gets false by this patch but still need that additional room for
> > stashed entries.  It seems like needing a fix.

Removed whole paragraph because fixed table without extra entries works
just fine.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

128MB and 1GB shared buffers
pgbench with scale 100
select_only benchmark, unix sockets.

Notebook i7-1165G7:


  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      29614 |      29285 |      32413 |      32784 
      2 |      58541 |      60052 |      65851 |      65938 
      3 |      91126 |      90185 |     101404 |     101956 
      5 |     135809 |     133670 |     143783 |     143471 
      7 |     155547 |     153568 |     162566 |     162361 
     17 |     221794 |     218143 |     250562 |     250136 
     27 |     213742 |     211226 |     241806 |     242594 
     53 |     216067 |     214792 |     245868 |     246269 
     83 |     216610 |     218261 |     246798 |     250515 
    107 |     216169 |     216656 |     248424 |     250105 
    139 |     208892 |     215054 |     244630 |     246439 
    163 |     206988 |     212751 |     244061 |     248051 
    191 |     203842 |     214764 |     241793 |     245081 
    211 |     201304 |     213997 |     240863 |     246076 
    239 |     199313 |     211713 |     239639 |     243586 
    271 |     196712 |     211849 |     236231 |     243831 
    307 |     194879 |     209813 |     233811 |     241303 
    353 |     191279 |     210145 |     230896 |     241039 
    397 |     188509 |     207480 |     227812 |     240637 

X5675 1 socket:

  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      18590 |      18473 |      19652 |      19051 
      2 |      34899 |      34799 |      37242 |      37432 
      3 |      51484 |      51393 |      54750 |      54398 
      5 |      71037 |      70564 |      76482 |      75985 
      7 |      87391 |      86937 |      96185 |      95433 
     17 |     122609 |     123087 |     140578 |     140325 
     27 |     120051 |     120508 |     136318 |     136343 
     53 |     116851 |     117601 |     133338 |     133265 
     83 |     113682 |     116755 |     131841 |     132736 
    107 |     111925 |     116003 |     130661 |     132386 
    139 |     109338 |     115011 |     128319 |     131453 
    163 |     107661 |     114398 |     126684 |     130677 
    191 |     105000 |     113745 |     124850 |     129909 
    211 |     103607 |     113347 |     123469 |     129302 
    239 |     101820 |     112428 |     121752 |     128621 
    271 |     100060 |     111863 |     119743 |     127624 
    307 |      98554 |     111270 |     117650 |     126877 
    353 |      97530 |     110231 |     115904 |     125351 
    397 |      96122 |     109471 |     113609 |     124150 

X5675 2 socket:

  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      17815 |      17577 |      19321 |      19187 
      2 |      34312 |      35655 |      37121 |      36479 
      3 |      51868 |      52165 |      56048 |      54984 
      5 |      81704 |      82477 |      90945 |      90109 
      7 |     107937 |     105411 |     116015 |     115810 
     17 |     191339 |     190813 |     216899 |     215775 
     27 |     236541 |     238078 |     278507 |     278073 
     53 |     230323 |     231709 |     267226 |     267449 
     83 |     225560 |     227455 |     261996 |     262344 
    107 |     221317 |     224030 |     259694 |     259553 
    139 |     206945 |     219005 |     254817 |     256736 
    163 |     197723 |     220353 |     251631 |     257305 
    191 |     193243 |     219149 |     246960 |     256528 
    211 |     189603 |     218545 |     245362 |     255785 
    239 |     186382 |     217229 |     240006 |     255024 
    271 |     183141 |     216359 |     236927 |     253069 
    307 |     179275 |     215218 |     232571 |     252375 
    353 |     175559 |     213298 |     227244 |     250534 
    397 |     172916 |     211627 |     223513 |     248919 

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

October master was 2d44dee0281a1abf and today's is 7e12256b478b895

(There is small possibility that I tested with TCP sockets
in october and with UNIX sockets today and that gave difference.)

[1] https://postgr.esq/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru

-------

regards
Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru



Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Column Filtering in Logical Replication
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: standby recovery fails (tablespace related) (tentative patch and discussion)