Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Improve eviction algorithm in ReorderBuffer
Date
Msg-id CAD21AoAtf12e9Z9NLBuaO1GjHMMo16_8R-yBu9Q9jrk2QLqMEA@mail.gmail.com
Whole thread Raw
In response to Re: Improve eviction algorithm in ReorderBuffer  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Improve eviction algorithm in ReorderBuffer
List pgsql-hackers
On Sat, Apr 6, 2024 at 5:44 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> >
> > It sounds like a data structure that mixes the hash table and the
> > binary heap and we use it as the main storage (e.g. for
> > ReorderBufferTXN) instead of using the binary heap as the secondary
> > data structure. IIUC with your idea, the indexed binary heap has a
> > hash table to store elements each of which has its index within the
> > heap node array. I guess it's better to create it as a new data
> > structure rather than extending the existing binaryheap, since APIs
> > could be very different. I might be missing something, though.
>
> You are right that this approach starts to feel like a new data
> structure and is not v17 material.
>
> I am interested in this for v18 though -- we could make the API more
> like simplehash to be more flexible when using values (rather than
> pointers) and to be able to inline the comparator.

Interesting project. It would be great if we could support increasing
and decreasing the key as APIs. The current
binaryheap_update_{up|down} APIs are not very user-friendly.

>
> > > * remove the hash table from binaryheap.c
> > >
> > > * supply a new callback to the binary heap with type like:
> > >
> > >   typedef void (*binaryheap_update_index)(
> > >     bh_node_type node,
> > >     int new_element_index);
> > >
> > > * make the remove, update_up, and update_down methods take the
> > > element
> > > index rather than the pointer
>
> ...
>
> > This shows that the current indexed binaryheap is much slower than
> > the
> > other two implementations, and the xx_binaryheap has a good number in
> > spite of also being indexed.
>
> xx_binaryheap isn't indexed though, right?

Well, yes. To be xact, xx_binaryheap isn't indexed but the element
indexes are stored in the element itself (see test_elem struct) so the
caller still can update the key using xx_binaryheap_update_{up|down}.

>
> > When it comes to
> > implementing the above idea (i.e. changing binaryheap to
> > xx_binaryheap), it was simple since we just replace the code where we
> > update the hash table with the code where we call the callback, if we
> > get the consensus on API change.
>
> That seems reasonable to me.
>
> > The fact that we use simplehash for the internal hash table might
> > make
> > this idea complex. If I understand your suggestion correctly, the
> > caller needs to tell the hash table the hash function when creating a
> > binaryheap but the hash function needs to be specified at a compile
> > time. We can use a dynahash instead but it would make the binaryheap
> > slow further.
>
> simplehash.h supports private_data, which makes it easier to track a
> callback.
>
> In binaryheap.c, that would look something like:
>
>   static inline uint32
>   binaryheap_hash(bh_nodeidx_hash *tab, uint32 key)
>   {
>      binaryheap_hashfunc hashfunc = tab->private_data;
>      return hashfunc(key);
>   }
>
>   ...
>   #define SH_HASH_KEY(tb, key) binaryheap_hash(tb, key)
>   ...
>
>   binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
>                       void *arg, binaryheap_hashfunc hashfunc)
>   {
>     ...
>     if (hashfunc != NULL)
>     {
>        /* could have a new structure, but we only need to
>         * store one pointer, so don't bother with palloc/pfree */
>        void *private_data = (void *)hashfunc;
>        heap->bh_nodeidx = bh_nodeidx_create(..., private_data);
>        ...
>
>
> And in reorderbuffer.c, define the callback like:
>
>   static uint32
>   reorderbuffer_xid_hash(TransactionId xid)
>   {
>     /* fasthash32 is 'static inline' so may
>      * be faster than hash_bytes()? */
>     return fasthash32(&xid, sizeof(TransactionId), 0);
>   }
>

Thanks, that's a good idea.

>
>
> In summary, there are two viable approaches for addressing the concerns
> in v17:
>
> 1. Provide callback to update ReorderBufferTXN->heap_element_index, and
> use that index (rather than the pointer) for updating the heap key
> (transaction size) or removing elements from the heap.
>
> 2. Provide callback for hashing, so that binaryheap.c can hash the xid
> value rather than the pointer.
>
> I don't have a strong opinion about which one to use. I prefer
> something closer to #1 for v18, but for v17 I suggest whichever one
> comes out simpler.

I've implemented prototypes of both ideas, and attached the draft patches.

I agree with you that something closer to #1 is for v18. Probably we
can implement the #1 idea while making binaryheap codes template like
simplehash.h. For v17, changes for #2 are smaller, but I'm concerned
that the new API that requires a hash function to be able to use
binaryheap_update_{up|down} might not be user friendly. In terms of
APIs, I prefer #1 idea. And changes for #1 can make the binaryheap
code simple, although it requires adding a variable in
ReorderBufferTXN instead. But overall, it can remove the hash table
and some functions so it looks better to me.

When it comes to performance overhead, I mentioned that there is some
regression in the current binaryheap even without indexing. Since
function calling contributed to the regression, inlining some
functions reduced some overheads. For example, inlining set_node() and
replace_node(), the same benchmark test I used showed:

postgres(1:88476)=# select * from generate_series(1,3) x(x), lateral
(select * from bench_load(false, 10000000 * (1+x-x)));
 x |   cnt    | load_ms | xx_load_ms | old_load_ms
---+----------+---------+------------+-------------
 1 | 10000000 |     502 |        624 |         427
 2 | 10000000 |     503 |        622 |         428
 3 | 10000000 |     502 |        621 |         427
(3 rows)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Jelte Fennema-Nio
Date:
Subject: Re: Flushing large data immediately in pqcomm
Next
From: Pavel Borisov
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum