Thread: Speeding up parts of the planner using a binary search tree structurefor nodes

Hi,

In [1] I mentioned that I'd like to look into coding a data structure
to allow Node types to be looked up more efficiently than what List
allows.  There are many places in the planner where we must determine
if a given Node is in some List of Nodes. This is an O(n) operation.
We could reduce these to as low as O(log n) with a binary search tree.

A few places in the code that does this which can become a problem
when the lists being searched are large:

get_eclass_for_sort_expr()
tlist_member()

For the former, it's quite easy to see this come up in the profiles by
querying a partitioned table with many partitions. e.g

create table listp (a int) partition by list(a);
select 'create table listp'||x::Text || ' partition of listp for
values in('||x::text||');' from generate_Series(1,10000)x;
\gexec
create index on listp(a);
explain select * from listp order by a;

There are a few places that this new data structure could be used. My
primary interest at the moment is EquivalenceClass.ec_members. There
are perhaps other interesting places, such as planner targetlists, but
obviously, any changes would need to be proved worthwhile before
they're made.

Per what I mentioned in [1], I'd like this structure to be a binary
search tree, (either a red/black or AVL binary search tree compacted
into an array.) So instead of having ->left ->right pointers, we have
left and right offsets back into the array.  This allows very fast and
cache-friendly unordered traversals of the tree, as it's an array
whenever we want it to be and a tree when we need to perform a lookup
of a specific value. Because I'm not using a true tree, i.e pointers
to elements, then it does not appear like I can use rbtree.c

The first step to make this work is to modify equalfuncs.c to add an
equalCompare() function which will return an int to indicate the sort
order of the node against another node. I've drafted a patch to
implement this and attached it here. I've done this in 2 phases, 0001
creates and implements datatype specific macros for the comparisons.
Doing this allows us to optimise the bool/char field comparison macros
in 0002 so we can do a simple subtransaction rather than 2
comparisons. The 0002 patch modifies each comparison macro and changes
all the equal functions to return int rather than bool.  The external
comparison function is equalCompare(). equal() just calls
equalCompare() and checks the value returned is 0.

With both patches, there is an increase in size of about 17% for the
object file for equalfuncs:

equalfuncs.o
Master: 56768 bytes
Patched: 66912 bytes

If I don't use the macros specifically optimised for bool and char,
then the size is 68240 bytes. So I think doing 0001 is worth it.

This does break the API for ExtensibleNodeMethods as it's no longer
enough to just have nodeEqual(). In the 0002 patch I've changed this
to nodeCompare(). Extension authors who are using this will need to
alter their code to implement a proper comparison function.

For now, I'm mostly just interested in getting feedback about making
this change to equalfuncs.c.  Does anyone object to it?

David

[1] https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com

Attachment
On Fri, May 29, 2020 at 10:47 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Hi,
>
> In [1] I mentioned that I'd like to look into coding a data structure
> to allow Node types to be looked up more efficiently than what List
> allows.  There are many places in the planner where we must determine
> if a given Node is in some List of Nodes. This is an O(n) operation.
> We could reduce these to as low as O(log n) with a binary search tree.
>
> A few places in the code that does this which can become a problem
> when the lists being searched are large:
>
> get_eclass_for_sort_expr()

eclass members have relids as their key. Can we use hash table instead
like how we manage join relations? I know that there are other cases
where we search for subset of relids instead of exact match. So the
hash table has to account for that. If we could somehow create a hash
value out of an expression node (we almost hash anything so that
should be possible) we should be able to use hash table for searching
expression as well.

> tlist_member()

hash table by expression as key here as well?

>
> For the former, it's quite easy to see this come up in the profiles by
> querying a partitioned table with many partitions. e.g
>
> create table listp (a int) partition by list(a);
> select 'create table listp'||x::Text || ' partition of listp for
> values in('||x::text||');' from generate_Series(1,10000)x;
> \gexec
> create index on listp(a);
> explain select * from listp order by a;
>
> There are a few places that this new data structure could be used. My
> primary interest at the moment is EquivalenceClass.ec_members. There
> are perhaps other interesting places, such as planner targetlists, but
> obviously, any changes would need to be proved worthwhile before
> they're made.
>
> Per what I mentioned in [1], I'd like this structure to be a binary
> search tree, (either a red/black or AVL binary search tree compacted
> into an array.) So instead of having ->left ->right pointers, we have
> left and right offsets back into the array.  This allows very fast and
> cache-friendly unordered traversals of the tree, as it's an array
> whenever we want it to be and a tree when we need to perform a lookup
> of a specific value. Because I'm not using a true tree, i.e pointers
> to elements, then it does not appear like I can use rbtree.c
>
> The first step to make this work is to modify equalfuncs.c to add an
> equalCompare() function which will return an int to indicate the sort
> order of the node against another node. I've drafted a patch to
> implement this and attached it here. I've done this in 2 phases, 0001
> creates and implements datatype specific macros for the comparisons.
> Doing this allows us to optimise the bool/char field comparison macros
> in 0002 so we can do a simple subtransaction rather than 2
> comparisons. The 0002 patch modifies each comparison macro and changes
> all the equal functions to return int rather than bool.  The external
> comparison function is equalCompare(). equal() just calls
> equalCompare() and checks the value returned is 0.
>
> With both patches, there is an increase in size of about 17% for the
> object file for equalfuncs:
>
> equalfuncs.o
> Master: 56768 bytes
> Patched: 66912 bytes
>
> If I don't use the macros specifically optimised for bool and char,
> then the size is 68240 bytes. So I think doing 0001 is worth it.
>
> This does break the API for ExtensibleNodeMethods as it's no longer
> enough to just have nodeEqual(). In the 0002 patch I've changed this
> to nodeCompare(). Extension authors who are using this will need to
> alter their code to implement a proper comparison function.
>
> For now, I'm mostly just interested in getting feedback about making
> this change to equalfuncs.c.  Does anyone object to it?
>
> David
>
> [1] https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com



-- 
Best Wishes,
Ashutosh Bapat



On Sat, 30 May 2020 at 01:52, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Fri, May 29, 2020 at 10:47 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > In [1] I mentioned that I'd like to look into coding a data structure
> > to allow Node types to be looked up more efficiently than what List
> > allows.  There are many places in the planner where we must determine
> > if a given Node is in some List of Nodes. This is an O(n) operation.
> > We could reduce these to as low as O(log n) with a binary search tree.
> >
> > A few places in the code that does this which can become a problem
> > when the lists being searched are large:
> >
> > get_eclass_for_sort_expr()
>
> eclass members have relids as their key. Can we use hash table instead
> like how we manage join relations? I know that there are other cases
> where we search for subset of relids instead of exact match. So the
> hash table has to account for that. If we could somehow create a hash
> value out of an expression node (we almost hash anything so that
> should be possible) we should be able to use hash table for searching
> expression as well.

This certainly could be done with hash tables. However, I feel it
raises the bar a little as it means we either need to maintain a hash
function for each node type or do some pre-processor magic in
equalfuncs.c to adjust the comparison macros into hashing macros to
allow it to be compiled again to implement hash functions.  If
everyone feels that's an okay thing to go and do then perhaps hash
tables are a better option.  I was just trying to keep the bar to some
level that I thought might be acceptable to the community.

It does seem likely to me that hash tables would be a more efficient
structure, but the gains might not be as big as you think. To perform
a lookup in the table we need to hash the entire node to get the hash
value, then perform at least one equal comparison.  With the Binary
Search Lists, we'd need more comparisons, but those can be partial
comparisons as they can abort early when we find the first mismatching
field in the Node. When the number of items in the list is fairly
small that might actually be less work, especially when the Node type
varies (since that's the first field we compare). Hash tables are
likely to become more efficient when the number of items is larger.
The general case is that we'll just have a small number of items. I'd
like to improve the less common situation when the number of items is
large with minimal impact for when the number of items is small.

> > tlist_member()
>
> hash table by expression as key here as well?

The order of the tlist does matter in many cases. I'm unsure how we
could track the order that items were added to the hash table and then
obtain the items back in that order in an efficient way. I imagine
we'd need to store the item in some subsequent data structure, such as
a List. There's obviously going to be some overhead to doing that.
The idea with the Binary Search Lists was that items would be stored
in an array, similar to List, but the cells of the list would contain
the offset index to its parent and left and right child.  New items
would always take the next free slot in the list, just like List does
so we'd easily be able to get the insert order by looping over the
array of elements in order.

David

> > [1] https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com



Hi David,

On Fri, May 29, 2020 at 2:16 PM David Rowley <dgrowleyml@gmail.com> wrote:
> In [1] I mentioned that I'd like to look into coding a data structure
> to allow Node types to be looked up more efficiently than what List
> allows.  There are many places in the planner where we must determine
> if a given Node is in some List of Nodes. This is an O(n) operation.
> We could reduce these to as low as O(log n) with a binary search tree.
>
> A few places in the code that does this which can become a problem
> when the lists being searched are large:
>
> get_eclass_for_sort_expr()
> tlist_member()
>
> For the former, it's quite easy to see this come up in the profiles by
> querying a partitioned table with many partitions. e.g
>
> create table listp (a int) partition by list(a);
> select 'create table listp'||x::Text || ' partition of listp for
> values in('||x::text||');' from generate_Series(1,10000)x;
> \gexec
> create index on listp(a);
> explain select * from listp order by a;
>
> There are a few places that this new data structure could be used. My
> primary interest at the moment is EquivalenceClass.ec_members. There
> are perhaps other interesting places, such as planner targetlists, but
> obviously, any changes would need to be proved worthwhile before
> they're made.

I'm glad to see you mention this problem.

I have often wondered if we could do something about these data
structures growing bigger with the number of partitions in the first
place.  I could imagine that when "child" EC members were designed,
Tom maybe didn't think we'd continue relying on the abstraction for
the 10000 partitions case.  Making the lookup algorithm smarter with
binary or hashed search, as you are trying to do, might help us
somewhat, but would it be interesting to consider alternative designs
for the underlying abstraction?  Sorry, I don't have any concrete
ideas to offer at the moment, but thought sharing this perspective of
the problem might inspire others whose may have some.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com





On Tue, 2 Jun 2020 at 03:13, David Rowley <dgrowleyml@gmail.com> wrote:


It does seem likely to me that hash tables would be a more efficient
structure, but the gains might not be as big as you think. To perform
a lookup in the table we need to hash the entire node to get the hash
value, then perform at least one equal comparison.  With the Binary
Search Lists, we'd need more comparisons, but those can be partial
comparisons as they can abort early when we find the first mismatching
field in the Node. When the number of items in the list is fairly
small that might actually be less work, especially when the Node type
varies (since that's the first field we compare). Hash tables are
likely to become more efficient when the number of items is larger.
The general case is that we'll just have a small number of items. I'd
like to improve the less common situation when the number of items is
large with minimal impact for when the number of items is small.

Agree with that. I think we can again borrow from the way we search a join - when small number of joins use list and for a larger number use hash table. I am not suggesting that we use list in this case, but the idea is to use two data structures. May be every eclass will use one of them depending upon the number of members. Queries involving partitioned tables with lots of partitions will have a large number of child eclass members.
 

> > tlist_member()
>
> hash table by expression as key here as well?

The order of the tlist does matter in many cases. I'm unsure how we
could track the order that items were added to the hash table and then
obtain the items back in that order in an efficient way. I imagine
we'd need to store the item in some subsequent data structure, such as
a List.
 
If we use hash table then we retain the list as well. But your idea below looks better.
 
There's obviously going to be some overhead to doing that.
The idea with the Binary Search Lists was that items would be stored
in an array, similar to List, but the cells of the list would contain
the offset index to its parent and left and right child.  New items
would always take the next free slot in the list, just like List does
so we'd easily be able to get the insert order by looping over the
array of elements in order.

That seems like a good idea. I am worried that the expression nodes do not have any inherent ordering and we are proposing to use a structure which relies on order.

--
Best Wishes,
Ashutosh