Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id f03f37547d1606d008a4adc2748742ce590bcaaf.camel@postgrespro.ru
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: BufferAlloc: don't take two simultaneous locks  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in 
> > В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > > Hmm. v8 returns stashed element with original patition index when the
> > > element is *not* reused.  But what I saw in the previous test runs is
> > > the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> > > behaving the same way (or somehow even worse) with the previous
> > > version.
> > 
> > v8 doesn't differ in REMOVE case neither from master nor from
> > previous version. It differs in RETURNED case only.
> > Or I didn't understand what you mean :(
> 
> In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> the freelist_idx of the new key.  v8 uses that of the old key (at the
> time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> returns the stashed)" case the stashed element is returned to its
> original partition.  But it is not what I mentioned.
> 
> On the other hand, once the stahsed element is reused by HASH_ENTER,
> it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> from old partition) case.  I suspect that ththat the frequent freelist
> starvation comes from the latter case.

Doubtfully. Due to probabilty theory, single partition doubdfully
will be too overflowed. Therefore, freelist.

But! With 128kb shared buffers there is just 32 buffers. 32 entry for
32 freelist partition - certainly some freelist partition will certainly
have 0 entry even if all entries are in freelists. 

> > > get_hash_entry continuously suffer lack of freelist
> > > entry. (FWIW, attached are the test-output diff for both master and
> > > patched)
> > > 
> > > master finally allocated 31 fresh elements for a 100s run.
> > > 
> > > > ALLOCED: 31        ;; freshly allocated
> > > 
> > > v8 finally borrowed 33620 times from another freelist and 0 freshly
> > > allocated (ah, this version changes that..)
> > > Finally v8 results in:
> > > 
> > > > RETURNED: 50806    ;; returned stashed elements
> > > > BORROWED: 33620    ;; borrowed from another freelist
> > > > REUSED: 1812664    ;; stashed
> > > > ASSIGNED: 1762377  ;; reused
> > > >(ALLOCED: 0)        ;; freshly allocated
> 
> (I misunderstand that v8 modified get_hash_entry's preference between
> allocation and borrowing.)
> 
> I re-ran the same check for v7 and it showed different result.
> 
> RETURNED: 1
> ALLOCED: 15
> BORROWED: 0
> REUSED: 505435
> ASSIGNED: 505462 (-27)  ## the counters are not locked.
> 
> > Is there any measurable performance hit cause of borrowing?
> > Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
> > shared buffers that is extremely small. (Or it was 128MB?)
> 
> It is intentional set small to get extremely frequent buffer
> replacements.  The point here was the patch actually can induce
> frequent freelist starvation.  And as you do, I also doubt the
> significance of the performance hit by that.  Just I was not usre.
>
> I re-ran the same for v8 and got a result largely different from the
> previous trial on the same v8.
> 
> RETURNED: 2
> ALLOCED: 0
> BORROWED: 435
> REUSED: 495444
> ASSIGNED: 495467 (-23)
> 
> Now "BORROWED" happens 0.8% of REUSED

0.08% actually :)

> 
> > Well, I think some spare entries could reduce borrowing if there is
> > a need. I'll test on 128MB with spare entries. If there is profit,
> > I'll return some, but will keep SharedBufHash fixed.
> 
> I don't doubt the benefit of this patch.  And now convinced by myself
> that the downside is negligible than the benefit.
> 
> > Master branch does less freelist manipulations since it  tries to
> > insert first and if there is collision it doesn't delete victim
> > buffer.
> > 
> > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > ...
> > > > Strange thing: both master and patched version has higher
> > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > than in first october version [1]. But lower tps at higher
> > > > connections number (>= 191 clients).
> > > > I'll try to bisect on master this unfortunate change.
> > > 
> > > The reversing of the preference order between freshly-allocation and
> > > borrow-from-another-freelist might affect.
> > 
> > `master` changed its behaviour as well.
> > It is not problem of the patch at all.
> 
> Agreed.  So I think we should go on this direction.

I've checked. Looks like something had changed on the server, since
old master commit behaves now same to new one (and differently to
how it behaved in October).
I remember maintainance downtime of the server in november/december.
Probably, kernel were upgraded or some system settings were changed.

> There are some last comments on v8.
> 
> +                                                                 HASH_FIXED_SIZE);
> 
> Ah, now I understand that this prevented allocation of new elements.
> I think this good to do for SharedBufHash.
> 
> 
> ====
> +       long            nfree;                  /* number of free entries in the list */
>         HASHELEMENT *freeList;          /* chain of free elements */
>  } FreeListData;
>  
> +#if SIZEOF_LONG == 4
> +typedef pg_atomic_uint32 nalloced_store_t;
> +typedef uint32 nalloced_value_t;
> +#define nalloced_read(a)       (long)pg_atomic_read_u32(a)
> +#define nalloced_add(a, v)     pg_atomic_fetch_add_u32((a), (uint32)(v))
> ====
> 
> I don't think nalloced needs to be the same width to long.  For the
> platforms with 32-bit long, anyway the possible degradation if any by
> 64-bit atomic there doesn't matter.  So don't we always define the
> atomic as 64bit and use the pg_atomic_* functions directly?

Some 32bit platforms has no native 64bit atomics. Then they are
emulated with locks.

Native atomic read/write is quite cheap. So I don't bother with
unlocked read/write for non-partitioned table. (And I don't know
which platform has sizeof(long)>4 without having native 64bit
atomic as well)

(May be I'm wrong a bit? element_alloc invokes nalloc_add, which
is atomic increment. Could it be expensive enough to be problem
in non-shared dynahash instances?)

If patch stick with pg_atomic_uint64 for nalloced, then it have
to separate read+write for partitioned(actually shared) and
non-partitioned cases.

Well, and for 32bit platform long is just enough. Why spend other
4 bytes per each dynahash?

By the way, there is unfortunate miss of PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
in port/atomics/arch-arm.h for aarch64 . I'll send patch for
in new thread.

> +               case HASH_REUSE:
> +                       if (currBucket != NULL)
> 
> Don't we need an assertion on (DunaHashReuse.element == NULL) here?

Common assert is higher on line 1094:

    Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);

I thought it is more accurate than duplicated in each switch case.

> -       size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
> +       /* size of lookup hash table */
> +       size = add_size(size, BufTableShmemSize(NBuffers));
> 
> I was not sure that this is safe, but actually I didn't get "out of
> shared memory". On second thought, I realized that when a dynahash
> entry is stashed, BufferAlloc always holding a buffer block, too.
> So now I'm sure that this is safe.
> 
> 
> That's all.

Thanks you very much for productive review and discussion.



regards,

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com




pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Out-of-tree certificate interferes ssltest
Next
From: Amit Kapila
Date:
Subject: Re: Column Filtering in Logical Replication