Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Improve eviction algorithm in ReorderBuffer
Date
Msg-id CAD21AoAtR2A4f1-JFUE9KRkiPqMRvxS+3W=ZR7FpuXm8VSVASQ@mail.gmail.com
Whole thread Raw
In response to Re: Improve eviction algorithm in ReorderBuffer  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Improve eviction algorithm in ReorderBuffer
Re: Improve eviction algorithm in ReorderBuffer
List pgsql-hackers
On Fri, Apr 5, 2024 at 2:55 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote:
> > > Perhaps it's not worth the effort though, if performance is already
> > > good enough?
> >
> > Yeah, it would be better to measure the overhead first. I'll do that.
>
> I have some further comments and I believe changes are required for
> v17.
>
> An indexed binary heap API requires both a comparator and a hash
> function to be specified, and has two different kinds of keys: the heap
> key (mutable) and the hash key (immutable). It provides heap methods
> and hashtable methods, and keep the two internal structures (heap and
> HT) in sync.

IIUC for example in ReorderBuffer, the heap key is transaction size
and the hash key is xid.

>
> The implementation in b840508644 uses the bh_node_type as the hash key,
> which is just a Datum, and it just hashes the bytes. I believe the
> implicit assumption is that the Datum is a pointer -- I'm not sure how
> one would use that API if the Datum were a value. Hashing a pointer
> seems strange to me and, while I see why you did it that way, I think
> it reflects that the API boundaries are not quite right.

I see your point. It assumes that the bh_node_type is a pointer or at
least unique. So it cannot work with Datum being a value.

>
> One consequence of using the pointer as the hash key is that you need
> to find the pointer first: you can't change or remove elements based on
> the transaction ID, you have to get the ReorderBufferTXN pointer by
> finding it in another structure, first. Currently, that's being done by
> searching ReorderBuffer->by_txn. So we actually have two hash tables
> for essentially the same purpose: one with xid as the key, and the
> other with the pointer as the key. That makes no sense -- let's have a
> proper indexed binary heap to look things up by xid (the internal HT)
> or by transaction size (using the internal heap).
>
> I suggest:
>
>   * Make a proper indexed binary heap API that accepts a hash function
> and provides both heap methods and HT methods that operate based on
> values (transaction size and transaction ID, respectively).
>   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> instead.
>
> This will be a net simplification in reorderbuffer.c, which is good,
> because that file makes use of a *lot* of data strucutres.
>

It sounds like a data structure that mixes the hash table and the
binary heap and we use it as the main storage (e.g. for
ReorderBufferTXN) instead of using the binary heap as the secondary
data structure. IIUC with your idea, the indexed binary heap has a
hash table to store elements each of which has its index within the
heap node array. I guess it's better to create it as a new data
structure rather than extending the existing binaryheap, since APIs
could be very different. I might be missing something, though.

On Fri, Apr 5, 2024 at 3:55 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote:
> >   * Make a proper indexed binary heap API that accepts a hash
> > function
> > and provides both heap methods and HT methods that operate based on
> > values (transaction size and transaction ID, respectively).
> >   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> > instead.
>
> An alternative idea:
>
> * remove the hash table from binaryheap.c
>
> * supply a new callback to the binary heap with type like:
>
>   typedef void (*binaryheap_update_index)(
>     bh_node_type node,
>     int new_element_index);
>
> * make the remove, update_up, and update_down methods take the element
> index rather than the pointer
>
> reorderbuffer.c would then do something like:
>
>   void
>   txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index)
>   {
>      txn->heap_element_index = new_element_index;
>   }
>
>   ...
>
>   txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...);
>
> and then binaryheap.c would effectively maintain txn-
> >heap_element_index, so reorderbuffer.c can pass that to the APIs that
> require the element index.

Thank you for the idea. I was thinking the same idea when considering
your previous comment. With this idea, we still use the binaryheap for
ReorderBuffer as the second data structure. Since we can implement
this idea with relatively small changes to the current binaryheap,
I've implemented it and measured performances.

I've attached a patch that adds an extension for benchmarking
binaryheap implementations. binaryheap_bench.c is the main test
module. To make the comparison between different binaryheap
implementations, the extension includes two different binaryheap
implementations. Therefore, binaryheap_bench.c uses three different
binaryheap implementation in total as the comment on top of the file
says:

/*
 * This benchmark tool uses three binary heap implementations.
 *
 * "binaryheap" is the current binaryheap implementation in PostgreSQL. That
 * is, it internally has a hash table to track each node index within the
 * node array.
 *
 * "xx_binaryheap" is based on "binaryheap" but remove the hash table.
 * Instead, it has each element have its index with in the node array. The
 * element's index is updated by the callback function,
xx_binaryheap_update_index_fn
 * specified when xx_binaryheap_allocate().
 *
 * "old_binaryheap" is the binaryheap implementation before the "indexed" binary
 * heap changes are made. It neither has a hash table internally nor
tracks nodes'
 * indexes.
 */

That is, xx_binaryheap is the binaryheap implementation suggested above.

The bench_load() function measures the time for adding elements (i.e.
using binaryheap_add() and similar). Here are results:

postgres(1:3882886)=# select * from generate_series(1,3) x(x), lateral
(select * from bench_load(true, 10000000 * (1+x-x)));
 x |   cnt    | load_ms | xx_load_ms | old_load_ms
---+----------+---------+------------+-------------
 1 | 10000000 |    4372 |        582 |         429
 2 | 10000000 |    4371 |        582 |         429
 3 | 10000000 |    4373 |        582 |         429
(3 rows)

This shows that the current indexed binaryheap is much slower than the
other two implementations, and the xx_binaryheap has a good number in
spite of also being indexed.

Here are another run that disables indexing on the current binaryheap:

postgres(1:3882886)=# select * from generate_series(1,3) x(x), lateral
(select * from bench_load(false, 10000000 * (1+x-x)));
 x |   cnt    | load_ms | xx_load_ms | old_load_ms
---+----------+---------+------------+-------------
 1 | 10000000 |     697 |        579 |         430
 2 | 10000000 |     704 |        582 |         430
 3 | 10000000 |     698 |        581 |         429
(3 rows)

This shows that there is still performance regression in the current
binaryheap even if the indexing is disabled. xx_binaryheap also has
some regressions. I haven't investigated the root cause yet though.

Overall, we can say there is a large room to improve the current
binaryheap performance, as you pointed out. When it comes to
implementing the above idea (i.e. changing binaryheap to
xx_binaryheap), it was simple since we just replace the code where we
update the hash table with the code where we call the callback, if we
get the consensus on API change.

> Another alternative is to keep the hash table in binaryheap.c, and
> supply a hash function that hashes the xid. That leaves us with two
> hash tables still, but it would be cleaner than hashing the pointer.
> That might be best for right now, and we can consider these other ideas
> later.

The fact that we use simplehash for the internal hash table might make
this idea complex. If I understand your suggestion correctly, the
caller needs to tell the hash table the hash function when creating a
binaryheap but the hash function needs to be specified at a compile
time. We can use a dynahash instead but it would make the binaryheap
slow further.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Next
From: Alexander Lakhin
Date:
Subject: Re: remaining sql/json patches