Thread: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

Hi All,

Following up on [1] ...

A restrictlist is a list of RestrictInfo nodes, each representing one
clause, applicable to a set of joins.  When partitionwise join is used
as a strategy, the restrictlists for child join are obtained by
translating the restrictlists for the parent in
try_partitionwise_join(). That function is called once for every join
order. E.g. when computing join ABC, it will be called for planning
joins AB, BC, AC, (AB)C, A(BC), B(AC). Every time it is called it will
translate the given parent restrictlist. This means that a
RestrictInfo node applicable to given child relations will be
translated as many times as the join orders in which those child
relations appear in different joining relations.

For example, consider a query "select * from A, B, C where A.a = B.a
and B.a = C.a" where A, B and C are partitioned tables. A has
partitions A1, A2, ... An. B has partitions B1, B2, ... Bn and C has
partitions C1, C2, ... Cn. Partitions Ai, Bi and Ci are matching
partitions respectively for all i's. The join ABC is computed as
Append of (A1B1C1, A2B2C2, ... AnBnCn). The clause A.a = B.a is
translated to A1.a = B1.a thrice, when computing A1B1, A1(B1C1) and
B1(A1C1) respectively. Similarly other clauses are translated multiple
times. Some extra translations also happen in
reparameterize_path_by_child().

These translations consume memory which remains allocated till the
statement finishes. A ResrtictInfo should be translated only once per
parent-child pair, thus avoiding consuming extra memory.

There are two patches attached
0001 - to measure memory consumption during planning. This is the same
one as attached to [1].
0002 - WIP patch to avoid repeated translations of RestrictInfo.
The WIP patch avoids repeated translations by tracking the child for
which a RestrictInfo is translated and reusing the same translation
every time it is requested. In order to track the translations,
RestrictInfo gets two new members.
1. parent_rinfo - In a child's RestrictInfo this points to the
RestrictInfo applicable to the topmost parent in partition hierarchy.
This is NULL in the topmost parent's RestrictInfo
2. child_rinfos - In a parent's RestrictInfo, this is a list that
contains all the translated child RestrictInfos. In child
RestrictInfos this is NULL.

Every translated RestrictInfo is stored in the top parent's
RestrictInfo child_rinfos. RestrictInfo::required_relids is used as a
key to search a given translation. I have intercepted
adjust_appendrel_attrs_mutator() to track translations as well as
avoid multiple translations. It first looks for an existing
translation when translating a RestrictInfo and creates a new one only
when one doesn't exist already.

Using this patch the memory consumption for the above query reduces as follows

Number of partitions: 1000

Number of tables | without patch  | with patch | % reduction |
being joined     |                |            |             |
--------------------------------------------------------------
               2 |      40.3 MiB  |   37.7 MiB |       6.43% |
               3 |     146.8 MiB  |  133.0 MiB |       9.42% |
               4 |     445.4 MiB  |  389.5 MiB |      12.57% |
               5 |    1563.2 MiB  | 1243.2 MiB |      20.47% |

The number of times a RestrictInfo requires to be translated increases
exponentially with the number of tables joined. Thus we see more
memory saved as the number of tables joined increases. When two tables
are joined there's only a single join planned so no extra translations
happen in try_partitionwise_join(). The memory saved in case of 2
joining tables comes from avoiding extra translations happening during
reparameterization of paths (in reparameterize_path_by_child()).

The attached patch is to show how much memory can be saved if we avoid
extra translation. But I want to discuss the following things about
the approach.

1. The patch uses RestrictInfo::required_relids as the key for
searching child RelOptInfos. I am not sure which of the two viz.
required_relids and clause_relids is a better key. required_relids
seems to be a subset of clause_relids and from the description it
looks like that's the set that decides the applicability of a clause
in a join. But clause_relids is obtained from all the Vars that appear
in the clause, so may be that's the one that matters for the
translations. Can somebody guide me?

2. The patch adds two extra pointers per RestrictInfo. They will
remain unused when partitionwise join is not used. Right now, I do not
see any increase in memory consumed by planner because of those
pointers even in case of unpartitioned tables; maybe they are absorbed
in memory alignment. They may show up as extra memory in the future. I
am wondering whether we can instead save and track translations in
PlannerInfo as a hash table using <rinfo_serial, required_relids (or
whatever is the answer to above question) of parent and child
respectively> as key. That will just add one extra pointer in
PlannerInfo when partitionwise join is not used. Please let me know
your suggestions.

3. I have changed adjust_appendrel_attrs_mutator() to return a
translated RestrictInfo if it already exists. IOW, it won't always
return a deep copy of given RestrictInfo as it does today. This can be
fixed by writing wrappers around adjust_appendrel_attrs() to translate
RestrictInfo specifically. But maybe we don't always need deep copies.
Are there any cases when we need translated deep copies of
RestrictInfo? Those cases will require fixing callers of
adjust_appendrel_attrs() instead of the mutator.

4. IIRC, when partitionwise join was implemented we had discussed
creating child RestrictInfos using a login similar to
build_joinrel_restrictlist(). That might be another way to build
RestrictInfo only once and use it multiple times. But we felt that it
was much harder problem to solve since it's not known which partitions
from joining partitioned tables will match and will be joined till we
enter try_partitionwise_join(), so the required RestrictInfos may not
be available in RelOptInfo::joininfo. Let me know your thoughts on
this.

Comments/suggestions welcome.

references
[1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Attachment

On Thu, Jul 27, 2023 at 10:06 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
0002 - WIP patch to avoid repeated translations of RestrictInfo.
The WIP patch avoids repeated translations by tracking the child for
which a RestrictInfo is translated and reusing the same translation
every time it is requested. In order to track the translations,
RestrictInfo gets two new members.
1. parent_rinfo - In a child's RestrictInfo this points to the
RestrictInfo applicable to the topmost parent in partition hierarchy.
This is NULL in the topmost parent's RestrictInfo
2. child_rinfos - In a parent's RestrictInfo, this is a list that
contains all the translated child RestrictInfos. In child
RestrictInfos this is NULL.

I haven't looked into the details but with 0002 patch I came across a
crash while planning the query below.

regression=# set enable_partitionwise_join to on;
SET
regression=# EXPLAIN (COSTS OFF)
SELECT * FROM prt1 t1, prt2 t2
WHERE t1.a = t2.b AND t1.a < 450 AND t2.b > 250 AND t1.b = 0;
server closed the connection unexpectedly

Thanks
Richard
On Tue, Aug 8, 2023 at 4:08 PM Richard Guo <guofenglinux@gmail.com> wrote:
>
>
> I haven't looked into the details but with 0002 patch I came across a
> crash while planning the query below.
>
> regression=# set enable_partitionwise_join to on;
> SET
> regression=# EXPLAIN (COSTS OFF)
> SELECT * FROM prt1 t1, prt2 t2
> WHERE t1.a = t2.b AND t1.a < 450 AND t2.b > 250 AND t1.b = 0;
> server closed the connection unexpectedly

Thanks for the report. PFA the patches with this issue fixed. There is
another issue seen when running partition_join.sql "ERROR:  index key
does not match expected index column". I will investigate that issue.

Right now I am looking at the 4th point in my first email in this
thread. [1]. I am trying to figure whether that approach would work.
If that approach doesn't work what's the best to track the
translations and also figure out answers to 1, 2, 3 there. Please let
me know your opinions if any.

--
Best Wishes,
Ashutosh Bapat

[1] https://www.postgresql.org/message-id/CAExHW5s=bCLMMq8n_bN6iU+Pjau0DS3z_6Dn6iLE69ESmsPMJQ@mail.gmail.com

Attachment
On Fri, 28 Jul 2023 at 02:06, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> 0001 - to measure memory consumption during planning. This is the same
> one as attached to [1].

I see you're recording the difference in the CurrentMemoryContext of
palloc'd memory before and after planning.  That won't really alert us
to problems if the planner briefly allocates, say 12GBs of memory
during, say the join search then quickly pfree's it again.  unless
it's an oversized chunk, aset.c won't free() any memory until
MemoryContextReset(). Chunks just go onto a freelist for reuse later.
So at the end of planning, the context may still have that 12GBs
malloc'd, yet your new EXPLAIN ANALYZE property might end up just
reporting a tiny fraction of that.

I wonder if it would be more useful to just have ExplainOneQuery()
create a new memory context and switch to that and just report the
context's mem_allocated at the end.

It's also slightly annoying that these planner-related summary outputs
are linked to EXPLAIN ANALYZE. We could be showing them in EXPLAIN
without ANALYZE.  If we were to change that now, it might be a bit
annoying for the regression tests as we'd need to go and add SUMMARY
OFF in a load of places...

drowley@amd3990x:~/pg_src/src/test/regress/sql$ git grep -i "costs off" | wc -l
1592

hmm, that would cause a bit of churn... :-(

David



On Wed, Aug 9, 2023 at 10:12 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 28 Jul 2023 at 02:06, Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > 0001 - to measure memory consumption during planning. This is the same
> > one as attached to [1].
>

I have started a separate thread to discuss this patch. I am taking
this discussion to that thread.

--
Best Wishes,
Ashutosh Bapat



I spent some time on 4th point below but also looked at other points.
Here's what I have found so far

On Thu, Jul 27, 2023 at 7:35 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> 1. The patch uses RestrictInfo::required_relids as the key for
> searching child RelOptInfos. I am not sure which of the two viz.
> required_relids and clause_relids is a better key. required_relids
> seems to be a subset of clause_relids and from the description it
> looks like that's the set that decides the applicability of a clause
> in a join. But clause_relids is obtained from all the Vars that appear
> in the clause, so may be that's the one that matters for the
> translations. Can somebody guide me?

I was wrong that required_relids is subset of clause_relids. The first
can contain OJ relids which the other can not. OJ relids do not have
any children, so they won't be translated. So clause_relids seems to
be a better key. I haven't made a decision yet.

>
> 2. The patch adds two extra pointers per RestrictInfo. They will
> remain unused when partitionwise join is not used. Right now, I do not
> see any increase in memory consumed by planner because of those
> pointers even in case of unpartitioned tables; maybe they are absorbed
> in memory alignment. They may show up as extra memory in the future. I
> am wondering whether we can instead save and track translations in
> PlannerInfo as a hash table using <rinfo_serial, required_relids (or
> whatever is the answer to above question) of parent and child
> respectively> as key. That will just add one extra pointer in
> PlannerInfo when partitionwise join is not used. Please let me know
> your suggestions.

I will go ahead with a pointer in PlannerInfo for now.

>
> 3. I have changed adjust_appendrel_attrs_mutator() to return a
> translated RestrictInfo if it already exists. IOW, it won't always
> return a deep copy of given RestrictInfo as it does today. This can be
> fixed by writing wrappers around adjust_appendrel_attrs() to translate
> RestrictInfo specifically. But maybe we don't always need deep copies.
> Are there any cases when we need translated deep copies of
> RestrictInfo? Those cases will require fixing callers of
> adjust_appendrel_attrs() instead of the mutator.

I think it's better to handle the tracking logic outside
adjust_appendrel_attrs. That will be some code churn but it will be
cleaner and won't affect anything other that partitionwise joins.

>
> 4. IIRC, when partitionwise join was implemented we had discussed
> creating child RestrictInfos using a login similar to
> build_joinrel_restrictlist(). That might be another way to build
> RestrictInfo only once and use it multiple times. But we felt that it
> was much harder problem to solve since it's not known which partitions
> from joining partitioned tables will match and will be joined till we
> enter try_partitionwise_join(), so the required RestrictInfos may not
> be available in RelOptInfo::joininfo. Let me know your thoughts on
> this.

Here's some lengthy description of why I feel translations are better
compared to computing restrictlist and joininfo for a child join from
joining relation's joininfo
Consider a query "select * from p, q, r where p.c1 = q.c1 and q.c1 =
r.c1 and p.c2 + q.c2 < r.c2 and p.c3 != q.c3 and q.c3 != r.c3". The
query has following clauses
1. p.c1 = q.c1
2. q.c1 = r.c1
3. p.c2 + q.c2 < r.c2
4. p.c3 != q.c3
5. q.c3 != r.c3

The first two clauses are added to EC machinery and do not appear in
joininfo. They appear in restrictlist when we construct clauses in
restrictlist from ECs. Let's ignore them for now.

Assume that each table p, q, r has partitions (p1, p2, ...), (q1, q2,
...) and (r1, r2, ... ) respectively. Each triplet (pi, qi,ri) forms
the set of matching partitions from p, q, r respectively for all "i".
Consider join, p1q1r1. We will generate relations p1, q1, r1, p1q1,
p1r1, q1r1 and p1q1r1 while building the last join. Below is
description of how these clauses would look in each of these relations
and the list they appear in when computing that join. Please notice
the numeric suffixes carefully.

p1.
joininfo: p1.c2 + q.c2 < r.c2, p1.c3 != q.c3
restrictlist: <>

q1
joininfo: p.c2 + q1.c2 < r.c2, p.c3 != q1.c3, q1.c3 != r.c3
restrictlist: <>

r1
joininfo: p.c2 + q.c2 < r1.c2, q.c3 != r1.c3
restrictlist: <>

p1q1
joininfo: p1.c2 + q1.c2 < r.c2, q1.c3 != r.c3
restrictlist: p1.c3 != q1.c3

q1r1
joininfo: p.c2 + q1.c2 < r1.c2, p.c3 != q1.c3
restrictlist: q1.c3 != r1.c3

p1r1
joininfo: p1.c2 + q.c2 < r1.c2, p1.c3 != q.c3, q.c3 != r1.c3
restrictlist: <>

p1q1r1
joininfo: <>
restrictlist for (p1q1)r1: p1.c2 + q1.c2 < r1.c2, q1.c3 != r1.c3
restrictlist for (p1r1)q1: p1.c2 + q1.c2 < r1.c2, p1.c3 != q1.c3, q1.c3 != r1.c3
restrictlist for p1(q1r1): p1.c2 + q1.c2 < r1.c2, p1.c3 != q1.c3

If we translate the clauses when building join e.g. translate p1.c3 !=
q1.c3 when building p1q1 or p1q1r1, it would cause repeated
translations. So the translations need to be saved in lower relations
when we detect matching partitions and then use these translations.
Something I have done in the attached patches. But the problem is the
same clause reaches its final translation through different
intermediate translations as the join search advances. E.g. the
evolution of p.c2 + q.c2 < r.c2 to p1.c2 + q1.c2 < r1.c2 has three
different intemediate translations at second level of join. Each of
these intermediate translations conflict with each other and none of
them can be saved in any of the second level joins as a candidate for
the last stage translation. Extending the logic in the patches would
make those more complicated.

Another possibility is to avoid the same clause being translated
multiple times when building the join using
RestrictInfo::rinfo_serial. But simply that won't help avoiding
repeated translations caused by different join orders. E.g. we won't
be able to detect that p.c2 + q.c2 < r.c2 has been translated to p1.c2
+ q1.c2 < r1.c2 already when we computed (p1r1)q1 or p1(q1r1) or
(p1q1)r1 whichever was computed earlier. For that we need some
tracking outside the join relations themselves like I did in my first
patch.

Coming back to the problem of generating child restrictlist clauses
from equivalence classes, I think it's easier with some tweaks to pass
child relids down to the minions when dealing with child joins. It
seems to be working as is but I haven't tested it thoroughly.

Obtaining child clauses from parent clauses by translation and
tracking the translations is less complex and may be more efficient
too. I will post a patch on those lines soon.

--
Best Wishes,
Ashutosh Bapat

Attachment
Hi All,

On Fri, Aug 11, 2023 at 6:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Obtaining child clauses from parent clauses by translation and
> tracking the translations is less complex and may be more efficient
> too. I will post a patch on those lines soon.
>

PFA patch set to add infrastructure to track RestrictInfo translations
and reuse them.

PlannerInfo gets a new member "struct HTAB *child_rinfo_hash" which is
a hash table of hash tables keyed by RestrictInfo::rinfo_serial. Each
element in the array is a hash table of RestrictInfos keyed by
RestrictInfo::required_relids as explained in my previous email. When
building child clauses when a. building child join rels or b. when
reparameterizing paths, we first access the first level hash table
using RestrictInfo::rinfo_serial of the parent and search the required
translation by computing the child RestrictInfo::required_relids
obtained by translating RestrictInfo::required_relids of the parent
RestrictInfo. If the translation doesn't exist, we create one and add
to the hash table.

RestrictInfo::required_relids is same for a RestrictInfo and its
commuted RestrictInfo. The order of operands is important for
IndexClauses. Hence we track the commuted RestrictInfo in a new field
RestrictInfo::comm_rinfo. RestrictInfo::is_commuted differentiates
between a RestrictInfo and its commuted version. This is explained as
a comment in the patch. This scheme has a minor benefit of saving
memory when the same RestrictInfo is commuted multiple times.

Hash table of hash table is used instead of an array of hash tables
since a. not every rinfo_serial has a RestrictInfo associated with it
b. not every RestrictInfo has translations, c. I don't think the exact
size of this array is not known till the planning ends since we
continue to create new clauses as we create new RelOptInfos. Of
course, an array can be repalloc'ed and unused slots in the array may
not waste a lot of memory. I am open to change hash table to an array
which may be more efficient.

With these set of patches, the memory consumption stands as below
Number of tables | without patch  | with patch | % reduction |
being joined     |                |            |             |
--------------------------------------------------------------
               2 |      40.8 MiB  |   37.4 MiB |       8.46% |
               3 |     151.6 MiB  |  135.0 MiB |      10.96% |
               4 |     464.0 MiB  |  403.6 MiB |      12.00% |
               5 |    1663.9 MiB  | 1329.1 MiB |      20.12% |

The patch set is thus
0001 - patch used to measure memory used during planning

0002 - Patch to free intermediate Relids computed by
adjust_child_relids_multilevel(). I didn't test memory consumption for
multi-level partitioning. But this is clear improvement. In that
function we free the AppendInfos array which as many pointers long as
the number of relids. So it doesn't make sense not to free the Relids
which can be {largest relid}/8 bytes long at least.

0003 - patch to save and reuse commuted RestrictInfo. This patch by
itself shows a small memory saving (3%) in the query below where the
same clause is commuted twice. The query does not contain any
partitioned tables.
create table t2 (a int primary key, b int, c int);
create index t2_a_b on t2(a, b);
select * from t2 where 10 = a
Memory used without patch: 13696 bytes
Memory used with patch: 13264 bytes

0004 - Patch which implements the hash table of hash table described
above and also code to avoid repeated RestrictInfo list translations.

I will add this patchset to next commitfest.

--
Best Wishes,
Ashutosh Bapat

Attachment
On Thu, Sep 14, 2023 at 4:39 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> The patch set is thus
> 0001 - patch used to measure memory used during planning
>
> 0002 - Patch to free intermediate Relids computed by
> adjust_child_relids_multilevel(). I didn't test memory consumption for
> multi-level partitioning. But this is clear improvement. In that
> function we free the AppendInfos array which as many pointers long as
> the number of relids. So it doesn't make sense not to free the Relids
> which can be {largest relid}/8 bytes long at least.
>
> 0003 - patch to save and reuse commuted RestrictInfo. This patch by
> itself shows a small memory saving (3%) in the query below where the
> same clause is commuted twice. The query does not contain any
> partitioned tables.
> create table t2 (a int primary key, b int, c int);
> create index t2_a_b on t2(a, b);
> select * from t2 where 10 = a
> Memory used without patch: 13696 bytes
> Memory used with patch: 13264 bytes
>
> 0004 - Patch which implements the hash table of hash table described
> above and also code to avoid repeated RestrictInfo list translations.
>
> I will add this patchset to next commitfest.
>
> --
> Best Wishes,
> Ashutosh Bapat

PFA rebased patches. Nothing changes in 0002, 0003 and 0004. 0001 is
the squashed version of the latest patch set at
https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1cSrjcmyYj8Pduuw@mail.gmail.com.

--
Best Wishes,
Ashutosh Bapat

Attachment

Hi! Thank you for your work on the subject, I think it's a really useful feature. I've reviewed your patch and I have a few questions.

First of all, have you thought about creating a gun parameter to display memory scheduling information? I agree that this is an important feature, but I think only for debugging.

Secondly, I noticed that for the child_rinfo_hash key you use a counter (int) and can it lead to collisions? Why didn't you generate a hash from childRestrictInfo for this? For example, something like how it is formed here [0].

[0] https://www.postgresql.org/message-id/43ad8a48-b980-410d-a83c-5beebf82a4ed%40postgrespro.ru

-- 
Regards,
Alena Rybakina
Postgres Professional
On 24.11.2023 13:20, Alena Rybakina wrote:

Hi! Thank you for your work on the subject, I think it's a really useful feature. I've reviewed your patch and I have a few questions.

First of all, have you thought about creating a gun parameter to display memory scheduling information? I agree that this is an important feature, but I think only for debugging.

Secondly, I noticed that for the child_rinfo_hash key you use a counter (int) and can it lead to collisions? Why didn't you generate a hash from childRestrictInfo for this? For example, something like how it is formed here [0].

[0] https://www.postgresql.org/message-id/43ad8a48-b980-410d-a83c-5beebf82a4ed%40postgrespro.ru

Sorry, my first question was not clear, I mean: have you thought about creating a guc parameter to display memory planning information?
-- 
Regards,
Alena Rybakina
On Fri, Nov 24, 2023 at 3:56 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> On 24.11.2023 13:20, Alena Rybakina wrote:
>
> Hi! Thank you for your work on the subject, I think it's a really useful feature.
>
> I've reviewed your patch and I have a few questions.
>
> First of all, have you thought about creating a gun parameter to display memory scheduling information? I agree that
thisis an important feature, but I think only for debugging. 

Not a GUC parameter but I have a proposal to use EXPLAIN for the same. [1]

>
> Secondly, I noticed that for the child_rinfo_hash key you use a counter (int) and can it lead to collisions? Why
didn'tyou generate a hash from childRestrictInfo for this? For example, something like how it is formed here [0]. 

Not usually. But that's the only "key" we have to access a set of
sematically same RestrictInfos. Relids is another key to access the
exact RestrictInfo. A child RestrictInfo can not be used since there
will many child RestrictInfos. Similar parent RestrictInfo can not be
used since there will be multiple forms of the same RestrictInfo.

[1] https://commitfest.postgresql.org/45/4492/

--
Best Wishes,
Ashutosh Bapat



On Tue, 31 Oct 2023 at 10:48, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Sep 14, 2023 at 4:39 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > The patch set is thus
> > 0001 - patch used to measure memory used during planning
> >
> > 0002 - Patch to free intermediate Relids computed by
> > adjust_child_relids_multilevel(). I didn't test memory consumption for
> > multi-level partitioning. But this is clear improvement. In that
> > function we free the AppendInfos array which as many pointers long as
> > the number of relids. So it doesn't make sense not to free the Relids
> > which can be {largest relid}/8 bytes long at least.
> >
> > 0003 - patch to save and reuse commuted RestrictInfo. This patch by
> > itself shows a small memory saving (3%) in the query below where the
> > same clause is commuted twice. The query does not contain any
> > partitioned tables.
> > create table t2 (a int primary key, b int, c int);
> > create index t2_a_b on t2(a, b);
> > select * from t2 where 10 = a
> > Memory used without patch: 13696 bytes
> > Memory used with patch: 13264 bytes
> >
> > 0004 - Patch which implements the hash table of hash table described
> > above and also code to avoid repeated RestrictInfo list translations.
> >
> > I will add this patchset to next commitfest.
> >
> > --
> > Best Wishes,
> > Ashutosh Bapat
>
> PFA rebased patches. Nothing changes in 0002, 0003 and 0004. 0001 is
> the squashed version of the latest patch set at
> https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1cSrjcmyYj8Pduuw@mail.gmail.com.

CFBot shows that the patch does not apply anymore as in [1]:
=== Applying patches on top of PostgreSQL commit ID
7014c9a4bba2d1b67d60687afb5b2091c1d07f73 ===
=== applying patch
./0001-Report-memory-used-for-planning-a-query-in--20231031.patch
...
patching file src/test/regress/expected/explain.out
Hunk #5 FAILED at 290.
Hunk #6 succeeded at 545 (offset 4 lines).
1 out of 6 hunks FAILED -- saving rejects to file
src/test/regress/expected/explain.out.rej
patching file src/tools/pgindent/typedefs.list
Hunk #1 succeeded at 1562 (offset 18 lines).

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_4564.log

Regards,
Vignesh



On Sat, Jan 27, 2024 at 8:26 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 31 Oct 2023 at 10:48, Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Thu, Sep 14, 2023 at 4:39 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > The patch set is thus
> > > 0001 - patch used to measure memory used during planning
> > >
> > > 0002 - Patch to free intermediate Relids computed by
> > > adjust_child_relids_multilevel(). I didn't test memory consumption for
> > > multi-level partitioning. But this is clear improvement. In that
> > > function we free the AppendInfos array which as many pointers long as
> > > the number of relids. So it doesn't make sense not to free the Relids
> > > which can be {largest relid}/8 bytes long at least.
> > >
> > > 0003 - patch to save and reuse commuted RestrictInfo. This patch by
> > > itself shows a small memory saving (3%) in the query below where the
> > > same clause is commuted twice. The query does not contain any
> > > partitioned tables.
> > > create table t2 (a int primary key, b int, c int);
> > > create index t2_a_b on t2(a, b);
> > > select * from t2 where 10 = a
> > > Memory used without patch: 13696 bytes
> > > Memory used with patch: 13264 bytes
> > >
> > > 0004 - Patch which implements the hash table of hash table described
> > > above and also code to avoid repeated RestrictInfo list translations.
> > >
> > > I will add this patchset to next commitfest.
> > >
> > > --
> > > Best Wishes,
> > > Ashutosh Bapat
> >
> > PFA rebased patches. Nothing changes in 0002, 0003 and 0004. 0001 is
> > the squashed version of the latest patch set at
> > https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1cSrjcmyYj8Pduuw@mail.gmail.com.
>
> CFBot shows that the patch does not apply anymore as in [1]:
> === Applying patches on top of PostgreSQL commit ID
> 7014c9a4bba2d1b67d60687afb5b2091c1d07f73 ===
> === applying patch
> ./0001-Report-memory-used-for-planning-a-query-in--20231031.patch
> ...
> patching file src/test/regress/expected/explain.out
> Hunk #5 FAILED at 290.
> Hunk #6 succeeded at 545 (offset 4 lines).
> 1 out of 6 hunks FAILED -- saving rejects to file
> src/test/regress/expected/explain.out.rej
> patching file src/tools/pgindent/typedefs.list
> Hunk #1 succeeded at 1562 (offset 18 lines).
>
> Please post an updated version for the same.
>
> [1] - http://cfbot.cputube.org/patch_46_4564.log
>
> Regards,
> Vignesh

Thanks Vignesh for the notification. PFA rebased patches. 0001 in
earlier patch-set is now removed.

--
Best Wishes,
Ashutosh Bapat

Attachment
Hi,

After taking a look at the patch optimizing SpecialJoinInfo allocations,
I decided to take a quick look at this one too. I don't have any
specific comments on the code yet, but it seems quite a bit more complex
than the other patch ... it's introducing a HTAB into the optimizer,
surely that has costs too.

I started by doing the same test as with the other patch, comparing
master to the two patches (applied independently and both). And I got
about this (in MB):

  tables    master     sjinfo    rinfo      both
  -----------------------------------------------
       2        37         36       34        33
       3       138        129      122       113
       4       421        376      363       318
       5      1495       1254     1172       931

Unlike the SpecialJoinInfo patch, I haven't seen any reduction in
planning time for this one.

The reduction in allocated memory is nice. I wonder what's allocating
the remaining memory, and we'd have to do to reduce it further.

However, this is a somewhat extreme example - it's joining 5 tables,
each with 1000 partitions, using a partition-wise join. It reduces the
amount of memory, but the planning time is still quite high (and
essentially the same as without the patch). So it's not like it'd make
them significantly more practical ... do we have any other ideas/plans
how to improve that?

AFAIK we don't expect this to improve "regular" cases with modest number
of tables / partitions, etc. But could it make them slower in some case?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 19/2/2024 06:05, Tomas Vondra wrote:
> However, this is a somewhat extreme example - it's joining 5 tables,
> each with 1000 partitions, using a partition-wise join. It reduces the
> amount of memory, but the planning time is still quite high (and
> essentially the same as without the patch). So it's not like it'd make
> them significantly more practical ... do we have any other ideas/plans
> how to improve that?
The planner principle of cleaning up all allocated structures after the 
optimization stage simplifies development and code. But, if we want to 
achieve horizontal scalability on many partitions, we should introduce 
per-partition memory context and reset it in between. GEQO already has a 
short-lived memory context, making designing extensions a bit more 
challenging but nothing too painful.

-- 
regards,
Andrei Lepikhov
Postgres Professional




On Mon, Feb 19, 2024 at 4:35 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> After taking a look at the patch optimizing SpecialJoinInfo allocations,
> I decided to take a quick look at this one too. I don't have any
> specific comments on the code yet, but it seems quite a bit more complex
> than the other patch ... it's introducing a HTAB into the optimizer,
> surely that has costs too.

Thanks for looking into this too.

>
> I started by doing the same test as with the other patch, comparing
> master to the two patches (applied independently and both). And I got
> about this (in MB):
>
>   tables    master     sjinfo    rinfo      both
>   -----------------------------------------------
>        2        37         36       34        33
>        3       138        129      122       113
>        4       421        376      363       318
>        5      1495       1254     1172       931
>
> Unlike the SpecialJoinInfo patch, I haven't seen any reduction in
> planning time for this one.

Yeah. That agreed with my observation as well.

>
> The reduction in allocated memory is nice. I wonder what's allocating
> the remaining memory, and we'd have to do to reduce it further.

Please see reply to SpecialJoinInfo thread. Other that the current
patches, we require memory efficient Relids implementation. I have
shared some ideas in the slides I shared in the other thread, but
haven't spent time experimenting with any ideas there.

>
> However, this is a somewhat extreme example - it's joining 5 tables,
> each with 1000 partitions, using a partition-wise join. It reduces the
> amount of memory, but the planning time is still quite high (and
> essentially the same as without the patch). So it's not like it'd make
> them significantly more practical ... do we have any other ideas/plans
> how to improve that?

Yuya has been working on reducing planning time [1]. Has some
significant improvements in that area based on my experiments. But
those patches are complex and still WIP.

>
> AFAIK we don't expect this to improve "regular" cases with modest number
> of tables / partitions, etc. But could it make them slower in some case?
>

AFAIR, my experiments did not show any degradation in regular cases
with modest number of tables/partitions. The variation in planning
time was with the usual planning time variations.

[1]
https://www.postgresql.org/message-id/flat/CAExHW5uVZ3E5RT9cXHaxQ_DEK7tasaMN%3DD6rPHcao5gcXanY5w%40mail.gmail.com#112b3e104e0f9e39eb007abe075aae20

--
Best Wishes,
Ashutosh Bapat




On Mon, Feb 19, 2024 at 5:17 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Mon, Feb 19, 2024 at 4:35 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> After taking a look at the patch optimizing SpecialJoinInfo allocations,
> I decided to take a quick look at this one too. I don't have any
> specific comments on the code yet, but it seems quite a bit more complex
> than the other patch ... it's introducing a HTAB into the optimizer,
> surely that has costs too.

Thanks for looking into this too.

>
> I started by doing the same test as with the other patch, comparing
> master to the two patches (applied independently and both). And I got
> about this (in MB):
>
>   tables    master     sjinfo    rinfo      both
>   -----------------------------------------------
>        2        37         36       34        33
>        3       138        129      122       113
>        4       421        376      363       318
>        5      1495       1254     1172       931
>
> Unlike the SpecialJoinInfo patch, I haven't seen any reduction in
> planning time for this one.

Yeah. That agreed with my observation as well.

>
> The reduction in allocated memory is nice. I wonder what's allocating
> the remaining memory, and we'd have to do to reduce it further.

Please see reply to SpecialJoinInfo thread. Other that the current
patches, we require memory efficient Relids implementation. I have
shared some ideas in the slides I shared in the other thread, but
haven't spent time experimenting with any ideas there.

>
> However, this is a somewhat extreme example - it's joining 5 tables,
> each with 1000 partitions, using a partition-wise join. It reduces the
> amount of memory, but the planning time is still quite high (and
> essentially the same as without the patch). So it's not like it'd make
> them significantly more practical ... do we have any other ideas/plans
> how to improve that?

Yuya has been working on reducing planning time [1]. Has some
significant improvements in that area based on my experiments. But
those patches are complex and still WIP.

>
> AFAIK we don't expect this to improve "regular" cases with modest number
> of tables / partitions, etc. But could it make them slower in some case?
>

AFAIR, my experiments did not show any degradation in regular cases
with modest number of tables/partitions. The variation in planning
time was with the usual planning time variations.

Documenting some comments from todays' patch review session
1. Instead of a nested hash table, it might be better to use a flat hash table to save more memory.
2. new comm_rinfo member in RestrictInfo may have problems when copying RestrictInfo or translating it. Instead commuted versions may be tracked outside RestrictInfo

Combining the above two, it feels like we need a single hash table with (commuted, rinfo_serial, relids) as key to store all the translations of a RestrictInfo.

--
Best Wishes,
Ashutosh Bapat
On Mon, Aug 19, 2024 at 6:43 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Sorry for the delay in my response.
>
> On Wed, May 29, 2024 at 9:34 AM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > Documenting some comments from todays' patch review session
>
> I forgot to mention back then that both of the suggestions below came
> from Tom Lane.
>
> > 1. Instead of a nested hash table, it might be better to use a flat hash table to save more memory.
>
> Done. It indeed saves memory without impacting planning time.
>
> > 2. new comm_rinfo member in RestrictInfo may have problems when copying RestrictInfo or translating it. Instead
commutedversions may be tracked outside RestrictInfo 
>
> After commit 767c598954bbf72e0535f667e2e0667765604b2a,
> repameterization of parent paths for child relation happen at the time
> of creating the plan. This reduces the number of child paths produced
> by reparameterization and also reduces the number of RestrictInfos
> that get translated during reparameterization. During
> reparameterization commuted parent RestrictInfos are required to be
> translated to child RestrictInfos. Before the commit, this led to
> translating the same commuted parent RestrictInfo multiple times.
> After the commit, only one path gets reparameterized for a given
> parent and child pair. Hence we do not produce multiple copies of the
> same commuted child RestrictInfo. Hence we don't need to keep track of
> commuted child RestrictInfos anymore. Removed that portion of code
> from the patches.
>
> I made detailed memory consumption measurements with this patch for
> number of partitions changed from 0 (unpartitioned) to 1000 and for 2
> to 5-way joins. They are available in the spreadsheet at [1]. The raw
> measurement data is in the first sheet named "pwj_mem_measurements raw
> numbers". The averages over multiple runs are in second sheet named
> "avg_numbers". Rest of the sheet represent the averages in more
> consumable manner. Please note that the averages make sense only for
> planning time since the memory consumption remains same with every
> run. Also note that EXPLAIN now reports planning memory consumption in
> kB. Any changes to memory consumption below 1kB are not reported and
> hence not noticed. Here are some observations.
>
> 1. When partitionwise join is not enabled, no changes to planner's
> memory consumption are observed. See sheet named "pwj disabled,
> planning memory".
>
> 2. When partitionwise join is enabled, upto 17% (206MB) memory is
> saved by the patch in case of 5-way self-join with 1000 partitions.
> This is the maximum memory saving observed. The amount of memory saved
> increases with the number of joins and number of partitions. See sheet
> with name "pwj enabled, planning memory"
>
> 3. After commit 767c598954bbf72e0535f667e2e0667765604b2a, we do not
> translate a parent RestrictInfo multiple times for the same
> parent-child pair in case of  a *2-way partitionwise join*. But we
> still consume memory in saving the child RestrictInfo in the hash
> table. Hence in case of 2-way join we see increased memory consumption
> with the patch compared to master. The memory consumption increases by
> 13kb, 23kB, 76kB and 146kB for 10, 100, 500, 1000 partitions
> respectively. This increase is smaller compared to the overall memory
> saving. In order to avoid this memory increase, we will need to avoid
> using hash table for 2-way join. We will need to know whether there
> will be more than one partitionwise join before translating the
> RestrictInfos for the first partitionwise join. This is hard to
> achieve in all the cases since the decision to use partitionwise join
> happens at the time of creating paths for a given join relation, which
> itself is computed on the fly. We may choose some heuristics which
> take into account the number of partitioned tables in the query, their
> partition schemes, and the quals in the query to decide whether or not
> to track the translated child RestrictInfos. But that will make the
> code more complex, but more importantly the heuristics may not be able
> to keep up if we start using partitionwise join as an optimization
> strategy for more cases (e.g. asymmetric partitionwise join [2]). The
> attached patch looks like a good tradeoff to me. But other opinions
> might vary. Suggestions are welcome.
>
> 4. There is no noticeable change in the planning time. I ran the same
> experiment multiple times. The planning time variations from each
> experiment do not show any noticeable pattern suggesting increase or
> decrease in the planning time with the patch.
>
> A note about the code: I have added all the structures and functions
> dealing with the RestrictInfo hash table at the end of restrictinfo.c.
> I have not come across a C file in PostgreSQL code base where private
> structures are defined in the middle the file; usually they are
> defined at the beginning of the file. But I have chosen it that way
> here since it makes it easy to document the hash table and the
> functions at one place at the beginning of this code section. I am
> open to suggestions which make the documentation easy while placing
> the structures at the beginning of the file.
>
> [1] https://docs.google.com/spreadsheets/d/1CEjRWZ02vuR8fSwhYaNugewtX8f0kIm5pLoRA95f3s8/edit?usp=sharing
> [2] https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com
>

Noticed that Tomas's email address has changed. Adding his new email address.

--
Best Wishes,
Ashutosh Bapat



> On Thu, Oct 10, 2024 at 05:36:10PM GMT, Ashutosh Bapat wrote:
>
> 3. With just patch 0001 applied, planning time usually shows
> degradation (column Q and X in planning time sheets) with or without
> PWJ enabled. I first thought that it might be because of the increased
> size of PlannerInfo. We had seen a similar phenomenon when adding a
> new member to WindowAggState [2]. Hence I introduced patch 0002 which
> moves two fields around to not increase the size of structure. But
> that doesn't fix the regression in the planning time (columns R and
> Y). Apart from increasing the PlannerInfo size and may be object file
> size, 0002 does not have any other impact. But the regression seen
> with just that patch is more than what we saw in [2]. More investigate
> is required to decide whether this regression is real or not and if
> real, the root cause. Looking at the numbers, it seems that this
> regression is causing the planning time regression in rest of the
> patches. If we fix regression by 0001, we should not see much
> regression in rest of the patches. I am looking for some guidance in
> investigating this regression.

Hi,

I've tried to reproduce some subset of those results, in case if I would
be able to notice anything useful. Strangely enough, I wasn't able to
get much boost in planning time e.g. with 4 first patches, 100
partitions and 5 joins -- the results you've posted are showing about
16% in that case, where I'm getting only a couple of percents. Probably
I'm doing something differently, but it's turned out to be hard to
reconstruct (based only on this thread) how did you exactly benchmark
the patch -- could you maybe summarize the benchmark in a reproducible
way?

From what I understand you were testing againt an empty partitioned table. Here
is what I was doing:

    create table t1p (c1 int) partition by list(c1);
    select format('create table %I partition of t1p for values in (%s)',
        't1p' || i, i) from generate_series(1, 100) i; \gexec

    do $x$
    declare
        i record;
        plan float[];
        plan_line text;
    begin
      for i in select * from generate_series(1, 1000) i loop
        for plan_line in execute format($y$
            explain analyze
                select * from t1p t1, t1p t2, t1p t3, t1p t4, t1p t5
                where t2.c1 =
                    t1.c1 and t3.c1 = t2.c1 and t4.c1 = t3.c1 and t5.c1 = t4.c1
            $y$) loop
            if plan_line like '%Planning Time%' then
                plan := array_append(plan, substring(plan_line from '\d+.\d+')::float);
            end if;
        end loop;
      end loop;

      -- skip the first record as prewarming
      raise warning 'avg: %',
        (select avg(v) from unnest(plan[2:]) v);
      raise warning 'median: %',
        (select percentile_cont(0.5) within group(order by v)
            from unnest(plan[2:]) v);
    end;
    $x$;

As a side note, may I ask to attach benchmark results as files, rather than a
link to a google spreadsheet? It feels nice to have something static to work
with.



On 2024-Nov-25, Ashutosh Bapat wrote:

> Hmm, I am doing something similar to what you are doing. Here are my
> scripts.  setup.sql - creates partitioned table, and functions, tables
> used to run the benchmark benchmark.sh - creates queries with all
> combinations of enable_partitionwise_join, number of partitions, joins
> etc. and runs EXPLAIN on each query recording the results in a table.

I was curious about this still being alive and uncommitted, so I
wondered if Dmitry was on to something about this patch not improving
things.  So I tried to rerun Ashutosh's benchmark (of course, on a build
with no C assertions, otherwise the numbers are meaningless).  First,
the patches still apply to current master.  Turns on that they improve
planning times not insignificantly in the majority of cases (not all).
This is on my laptop and I didn't do anything in particular to keep it
stable, though.

I don't know about this modern googol sheets thing you're talking about,
so I use \crosstabview like my cavemen forefathers did.  After running
benchmark.sh (note: it needs "-X" on the psql lines, otherwise it gets
all confused by funny stuff in my .psqlrc) to get the SQL to run, I
obtain a summary table with this psql query:

select code_tag, num_parts, num_joins,
  format ('n=%s avg=%s dev=%s', count(planning_time_ms) filter (where pwj),
          (avg(planning_time_ms) filter (where pwj))::numeric(6, 2),
          (stddev(planning_time_ms) filter (where pwj))::numeric(6,2))
from msmts where code_tag = 'master'
group by code_tag, num_parts, num_joins
order by 1, 2, 3 \crosstabview 2 3 4

Here's the tables I got.

PWJ on, master:
 num_parts │             2             │             3              │              4              │              5
       
 

───────────┼───────────────────────────┼────────────────────────────┼─────────────────────────────┼─────────────────────────────
         0 │ n=10 avg=0.05 dev=0.00    │ n=10 avg=0.16 dev=0.04     │ n=10 avg=0.38 dev=0.10      │ n=10 avg=0.95
dev=0.14
        10 │ n=10 avg=0.41 dev=0.02    │ n=10 avg=1.25 dev=0.09     │ n=10 avg=4.93 dev=0.30      │ n=10 avg=12.12
dev=0.69
       100 │ n=10 avg=4.68 dev=0.40    │ n=10 avg=21.34 dev=1.04    │ n=10 avg=65.09 dev=1.91     │ n=10 avg=206.87
dev=3.87
       500 │ n=10 avg=55.11 dev=1.43   │ n=10 avg=240.97 dev=7.90   │ n=10 avg=834.72 dev=35.67   │ n=10 avg=2534.78
dev=107.28
      1000 │ n=10 avg=242.40 dev=21.09 │ n=10 avg=1085.65 dev=38.11 │ n=10 avg=3161.00 dev=151.04 │ n=10 avg=9634.34
dev=635.57

PWJ on, all patches:
 num_parts │            2             │             3             │              4              │              5
     
 

───────────┼──────────────────────────┼───────────────────────────┼─────────────────────────────┼─────────────────────────────
         0 │ n=10 avg=0.05 dev=0.00   │ n=10 avg=0.12 dev=0.01    │ n=10 avg=0.34 dev=0.01      │ n=10 avg=0.91
dev=0.02
        10 │ n=10 avg=0.37 dev=0.01   │ n=10 avg=1.17 dev=0.07    │ n=10 avg=4.09 dev=0.25      │ n=10 avg=10.31
dev=0.38
       100 │ n=10 avg=4.62 dev=0.14   │ n=10 avg=17.17 dev=0.45   │ n=10 avg=54.05 dev=0.98     │ n=10 avg=178.05
dev=2.69
       500 │ n=10 avg=61.32 dev=1.91  │ n=10 avg=229.54 dev=15.82 │ n=10 avg=701.33 dev=34.16   │ n=10 avg=2176.00
dev=84.28
      1000 │ n=10 avg=195.74 dev=5.73 │ n=10 avg=789.49 dev=16.44 │ n=10 avg=2786.55 dev=254.03 │ n=10 avg=9177.05
dev=467.33


PWJ off, master:
 num_parts │            2             │             3             │             4              │              5
    
 

───────────┼──────────────────────────┼───────────────────────────┼────────────────────────────┼─────────────────────────────
         0 │ n=10 avg=0.06 dev=0.02   │ n=10 avg=0.16 dev=0.04    │ n=10 avg=0.39 dev=0.07     │ n=10 avg=1.08
dev=0.17
        10 │ n=10 avg=0.27 dev=0.03   │ n=10 avg=0.54 dev=0.01    │ n=10 avg=1.05 dev=0.03     │ n=10 avg=2.09
dev=0.07
       100 │ n=10 avg=5.17 dev=2.45   │ n=10 avg=8.96 dev=0.14    │ n=10 avg=17.25 dev=0.29    │ n=10 avg=36.11
dev=1.06
       500 │ n=10 avg=46.82 dev=1.84  │ n=10 avg=149.06 dev=2.79  │ n=10 avg=396.95 dev=26.15  │ n=10 avg=912.93
dev=31.78
      1000 │ n=10 avg=219.86 dev=5.21 │ n=10 avg=697.27 dev=14.96 │ n=10 avg=1925.81 dev=65.78 │ n=10 avg=4857.81
dev=248.71


PWJ off, allpatches:
 num_parts │             2             │             3             │             4              │              5
     
 

───────────┼───────────────────────────┼───────────────────────────┼────────────────────────────┼─────────────────────────────
         0 │ n=10 avg=0.06 dev=0.01    │ n=10 avg=0.13 dev=0.02    │ n=10 avg=0.34 dev=0.02     │ n=10 avg=0.95
dev=0.06
        10 │ n=10 avg=0.25 dev=0.02    │ n=10 avg=0.52 dev=0.01    │ n=10 avg=0.96 dev=0.01     │ n=10 avg=1.86
dev=0.01
       100 │ n=10 avg=5.43 dev=2.37    │ n=10 avg=7.30 dev=0.16    │ n=10 avg=12.93 dev=0.35    │ n=10 avg=24.56
dev=0.49
       500 │ n=10 avg=50.10 dev=2.35   │ n=10 avg=156.04 dev=11.05 │ n=10 avg=332.48 dev=17.44  │ n=10 avg=711.22
dev=21.44
      1000 │ n=10 avg=174.02 dev=15.26 │ n=10 avg=567.23 dev=8.81  │ n=10 avg=1480.75 dev=45.04 │ n=10 avg=3578.19
dev=240.20


So it looks to me like for high number of partitions and joins, this
wins hands down in terms of planning time.  For some of the 2/3 joins
and 100/500 partitions, it loses.


As for planner memory, which this was supposed to improve, I don't find
any significant improvement, except for 1000 partitions and 5 joins
(where it goes from 1071991 to 874480 kilobytes); IMO it's not worth
framing this improvement from that point of view, because it doesn't
seem compelling, at least to me.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Hay dos momentos en la vida de un hombre en los que no debería
especular: cuando puede permitírselo y cuando no puede" (Mark Twain)



On Fri, Jan 31, 2025 at 5:46 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2025-Jan-31, Alvaro Herrera wrote:
>
> > So I tried to rerun Ashutosh's benchmark (of course, on a build
> > with no C assertions, otherwise the numbers are meaningless).  First,
> > the patches still apply to current master.
>
> (BTW I just realized that I applied patches 0001-0004, omitting 0005).

That's ok. The last patch shows a minor regression wrt 0004. It will
be dropped from the patchset anyway.

--
Best Wishes,
Ashutosh Bapat



On Fri, Jan 31, 2025 at 5:41 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Nov-25, Ashutosh Bapat wrote:
>
> > Hmm, I am doing something similar to what you are doing. Here are my
> > scripts.  setup.sql - creates partitioned table, and functions, tables
> > used to run the benchmark benchmark.sh - creates queries with all
> > combinations of enable_partitionwise_join, number of partitions, joins
> > etc. and runs EXPLAIN on each query recording the results in a table.
>
> I was curious about this still being alive and uncommitted, so I
> wondered if Dmitry was on to something about this patch not improving
> things.  So I tried to rerun Ashutosh's benchmark (of course, on a build
> with no C assertions, otherwise the numbers are meaningless).  First,
> the patches still apply to current master.  Turns on that they improve
> planning times not insignificantly in the majority of cases (not all).
> This is on my laptop and I didn't do anything in particular to keep it
> stable, though.
>
> I don't know about this modern googol sheets thing you're talking about,
> so I use \crosstabview like my cavemen forefathers did.  After running
> benchmark.sh

Thanks for suggesting \crosstabview. The summary table looks good. But
it doesn't tell the difference (absolute or %), so the reader has to
do the maths themselves. Maybe I could improve the query itself to
report the difference.

> (note: it needs "-X" on the psql lines, otherwise it gets
> all confused by funny stuff in my .psqlrc) to get the SQL to run, I
> obtain a summary table with this psql query:

Will incorporate -X

>
> select code_tag, num_parts, num_joins,
>   format ('n=%s avg=%s dev=%s', count(planning_time_ms) filter (where pwj),
>           (avg(planning_time_ms) filter (where pwj))::numeric(6, 2),
>           (stddev(planning_time_ms) filter (where pwj))::numeric(6,2))
> from msmts where code_tag = 'master'
> group by code_tag, num_parts, num_joins
> order by 1, 2, 3 \crosstabview 2 3 4
>
> Here's the tables I got.
>
> PWJ on, master:
>  num_parts │             2             │             3              │              4              │              5
>
───────────┼───────────────────────────┼────────────────────────────┼─────────────────────────────┼─────────────────────────────
>          0 │ n=10 avg=0.05 dev=0.00    │ n=10 avg=0.16 dev=0.04     │ n=10 avg=0.38 dev=0.10      │ n=10 avg=0.95
dev=0.14
>         10 │ n=10 avg=0.41 dev=0.02    │ n=10 avg=1.25 dev=0.09     │ n=10 avg=4.93 dev=0.30      │ n=10 avg=12.12
dev=0.69
>        100 │ n=10 avg=4.68 dev=0.40    │ n=10 avg=21.34 dev=1.04    │ n=10 avg=65.09 dev=1.91     │ n=10 avg=206.87
dev=3.87
>        500 │ n=10 avg=55.11 dev=1.43   │ n=10 avg=240.97 dev=7.90   │ n=10 avg=834.72 dev=35.67   │ n=10 avg=2534.78
dev=107.28
>       1000 │ n=10 avg=242.40 dev=21.09 │ n=10 avg=1085.65 dev=38.11 │ n=10 avg=3161.00 dev=151.04 │ n=10 avg=9634.34
dev=635.57
>
> PWJ on, all patches:
>  num_parts │            2             │             3             │              4              │              5
>
───────────┼──────────────────────────┼───────────────────────────┼─────────────────────────────┼─────────────────────────────
>          0 │ n=10 avg=0.05 dev=0.00   │ n=10 avg=0.12 dev=0.01    │ n=10 avg=0.34 dev=0.01      │ n=10 avg=0.91
dev=0.02
>         10 │ n=10 avg=0.37 dev=0.01   │ n=10 avg=1.17 dev=0.07    │ n=10 avg=4.09 dev=0.25      │ n=10 avg=10.31
dev=0.38
>        100 │ n=10 avg=4.62 dev=0.14   │ n=10 avg=17.17 dev=0.45   │ n=10 avg=54.05 dev=0.98     │ n=10 avg=178.05
dev=2.69
>        500 │ n=10 avg=61.32 dev=1.91  │ n=10 avg=229.54 dev=15.82 │ n=10 avg=701.33 dev=34.16   │ n=10 avg=2176.00
dev=84.28
>       1000 │ n=10 avg=195.74 dev=5.73 │ n=10 avg=789.49 dev=16.44 │ n=10 avg=2786.55 dev=254.03 │ n=10 avg=9177.05
dev=467.33
>
>
> PWJ off, master:
>  num_parts │            2             │             3             │             4              │              5
>
───────────┼──────────────────────────┼───────────────────────────┼────────────────────────────┼─────────────────────────────
>          0 │ n=10 avg=0.06 dev=0.02   │ n=10 avg=0.16 dev=0.04    │ n=10 avg=0.39 dev=0.07     │ n=10 avg=1.08
dev=0.17
>         10 │ n=10 avg=0.27 dev=0.03   │ n=10 avg=0.54 dev=0.01    │ n=10 avg=1.05 dev=0.03     │ n=10 avg=2.09
dev=0.07
>        100 │ n=10 avg=5.17 dev=2.45   │ n=10 avg=8.96 dev=0.14    │ n=10 avg=17.25 dev=0.29    │ n=10 avg=36.11
dev=1.06
>        500 │ n=10 avg=46.82 dev=1.84  │ n=10 avg=149.06 dev=2.79  │ n=10 avg=396.95 dev=26.15  │ n=10 avg=912.93
dev=31.78
>       1000 │ n=10 avg=219.86 dev=5.21 │ n=10 avg=697.27 dev=14.96 │ n=10 avg=1925.81 dev=65.78 │ n=10 avg=4857.81
dev=248.71
>
>
> PWJ off, allpatches:
>  num_parts │             2             │             3             │             4              │              5
>
───────────┼───────────────────────────┼───────────────────────────┼────────────────────────────┼─────────────────────────────
>          0 │ n=10 avg=0.06 dev=0.01    │ n=10 avg=0.13 dev=0.02    │ n=10 avg=0.34 dev=0.02     │ n=10 avg=0.95
dev=0.06
>         10 │ n=10 avg=0.25 dev=0.02    │ n=10 avg=0.52 dev=0.01    │ n=10 avg=0.96 dev=0.01     │ n=10 avg=1.86
dev=0.01
>        100 │ n=10 avg=5.43 dev=2.37    │ n=10 avg=7.30 dev=0.16    │ n=10 avg=12.93 dev=0.35    │ n=10 avg=24.56
dev=0.49
>        500 │ n=10 avg=50.10 dev=2.35   │ n=10 avg=156.04 dev=11.05 │ n=10 avg=332.48 dev=17.44  │ n=10 avg=711.22
dev=21.44
>       1000 │ n=10 avg=174.02 dev=15.26 │ n=10 avg=567.23 dev=8.81  │ n=10 avg=1480.75 dev=45.04 │ n=10 avg=3578.19
dev=240.20
>
>
> So it looks to me like for high number of partitions and joins, this
> wins hands down in terms of planning time.  For some of the 2/3 joins
> and 100/500 partitions, it loses.
>

The combination for which the planning time regresses is not fixed -
it shifts every time I run the benchmark. But I see regression with
one or the other combination. So I haven't been able to decide whether
it's a real regression or not. Planning time for a small number of
joins vary a lot from run to run.

>
> As for planner memory, which this was supposed to improve, I don't find
> any significant improvement, except for 1000 partitions and 5 joins
> (where it goes from 1071991 to 874480 kilobytes); IMO it's not worth
> framing this improvement from that point of view, because it doesn't
> seem compelling, at least to me.

If we are not interested in saving memory, there is a simpler way to
improve planning time by adding a hash table per equivalence class to
store the derived clauses, instead of a linked list, when the number
of derived clauses is higher than a threshold (say 32 same as the
threshold for join_rel_list. Maybe that approach will yield stable
planning time.

-- 
Best Wishes,
Ashutosh Bapat

On Tue, Feb 4, 2025 at 4:07 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> If we are not interested in saving memory, there is a simpler way to
> improve planning time by adding a hash table per equivalence class to
> store the derived clauses, instead of a linked list, when the number
> of derived clauses is higher than a threshold (say 32 same as the
> threshold for join_rel_list. Maybe that approach will yield stable
> planning time.
>

PFA patchset with conflict, with latest HEAD, resolved. I have dropped
0005 from the previous patchset since it didn't show any significant
performance difference. Other than these two things, the patchset is
same as the previous one.


--
Best Wishes,
Ashutosh Bapat

Attachment
> On Mon, Nov 25, 2024 at 11:20:05AM GMT, Ashutosh Bapat wrote:
> > I've tried to reproduce some subset of those results, in case if I would
> > be able to notice anything useful. Strangely enough, I wasn't able to
> > get much boost in planning time e.g. with 4 first patches, 100
> > partitions and 5 joins -- the results you've posted are showing about
> > 16% in that case, where I'm getting only a couple of percents. Probably
> > I'm doing something differently, but it's turned out to be hard to
> > reconstruct (based only on this thread) how did you exactly benchmark
> > the patch -- could you maybe summarize the benchmark in a reproducible
> > way?
>
> Hmm, I am doing something similar to what you are doing. Here are my scripts.
> setup.sql - creates partitioned table, and functions, tables used to
> run the benchmark
> benchmark.sh - creates queries with all combinations of
> enable_partitionwise_join, number of partitions, joins etc. and runs
> EXPLAIN on each query recording the results in a table.
> run_bm_on_commits.sh - runs setup.sql once, then runs benchmark.sh on
> each commit (using git rebase) and finally outputs the average numbers
> to be fed to the "aggregate numbers" sheet.

Just FYI, I've finally found time to figure out why do I get slightly
different results. It turns out I was running tests against a
partitioned table without a primary key, which obviously affects
planner, making planning time shorter and reducing the delta between the
patched version and the main branch.

But of course a partitioned table without a pk makes little sense, so I
guess those results are not very relevant. I've done another round with
pk, and got results similar to yours.



Hi,

On Thu, Feb 20, 2025 at 5:28 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Feb 4, 2025 at 4:07 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > If we are not interested in saving memory, there is a simpler way to
> > improve planning time by adding a hash table per equivalence class to
> > store the derived clauses, instead of a linked list, when the number
> > of derived clauses is higher than a threshold (say 32 same as the
> > threshold for join_rel_list. Maybe that approach will yield stable
> > planning time.
> >

I implemented the above idea in attached patches. I also added the
following query, inspired from Alvaro's query, to summarise the
results.
with master_avgs as
(select code_tag, num_parts, num_joins, pwj, avg(planning_time_ms)
avg_pt, stddev(planning_time_ms) stddev_pt
from msmts where code_tag = 'master'
group by code_tag, num_parts, num_joins, pwj),
patched_avgs as
(select code_tag, num_parts, num_joins, pwj, avg(planning_time_ms)
avg_pt, stddev(planning_time_ms) stddev_pt
from msmts where code_tag = 'patched'
group by code_tag, num_parts, num_joins, pwj)
select num_parts,
num_joins,
format('s=%s%% md=%s%% pd=%s%%',
((m.avg_pt - p.avg_pt)/m.avg_pt * 100)::numeric(6, 2),
(m.stddev_pt/m.avg_pt * 100)::numeric(6, 2),
(p.stddev_pt/p.avg_pt * 100)::numeric(6, 2))
from master_avgs m join patched_avgs p using (num_parts, num_joins,
pwj) where not pwj order by 1, 2, 3;
\crosstabview 1 2 3

not pwj in the last line should be changed to pwj to get results with
enable_partitionwise_join = true.

With the attached patches, I observe following results

With PWJ disabled
 num_parts |               2               |              3
   |             4              |              5

-----------+-------------------------------+------------------------------+----------------------------+-----------------------------
         0 | s=-4.44% md=17.91% pd=23.05%  | s=-0.83% md=11.10%
pd=19.58% | s=0.87% md=4.04% pd=7.91%  | s=-35.24% md=7.63% pd=9.69%
        10 | s=30.13% md=118.18% pd=37.44% | s=-3.49% md=0.58%
pd=0.49%   | s=-0.83% md=0.29% pd=0.35% | s=-0.24% md=0.35% pd=0.32%
       100 | s=1.94% md=13.19% pd=4.08%    | s=-0.27% md=0.18%
pd=0.44%   | s=7.04% md=3.05% pd=3.11%  | s=12.75% md=1.69% pd=0.81%
       500 | s=4.39% md=1.71% pd=1.33%     | s=10.17% md=1.28%
pd=1.90%   | s=23.04% md=0.24% pd=0.58% | s=30.87% md=0.30% pd=1.11%
      1000 | s=4.27% md=1.21% pd=1.97%     | s=13.97% md=0.44%
pd=0.79%   | s=24.05% md=0.63% pd=1.02% | s=30.77% md=0.77% pd=0.17%


Each cell is a triple (s, md, pd) where s is improvement in planning
time using the patches in % as compared to the master (higher the
better), md = standard deviation as % of the average planning time on
master, pd = is standard deviation as % of the average planning time
with patches.

With PWJ enabled
 num_parts |              2               |              3
  |              4              |              5

-----------+------------------------------+------------------------------+-----------------------------+------------------------------
         0 | s=-94.25% md=6.98% pd=56.03% | s=44.10% md=141.13%
pd=9.32% | s=42.71% md=46.00% pd=6.55% | s=-26.12% md=6.72% pd=15.20%
        10 | s=-25.89% md=4.29% pd=63.75% | s=-1.34% md=3.15% pd=3.26%
  | s=0.31% md=4.13% pd=4.34%   | s=-1.34% md=3.10% pd=6.73%
       100 | s=-2.83% md=0.94% pd=1.31%   | s=-2.17% md=4.57% pd=4.41%
  | s=0.98% md=1.59% pd=1.81%   | s=1.87% md=1.10% pd=0.79%
       500 | s=1.57% md=3.01% pd=1.70%    | s=6.99% md=1.58% pd=1.68%
  | s=11.11% md=0.24% pd=0.62%  | s=11.65% md=0.18% pd=0.90%
      1000 | s=3.59% md=0.98% pd=1.78%    | s=10.83% md=0.88% pd=0.46%
  | s=15.62% md=0.46% pd=0.13%  | s=16.38% md=0.63% pd=0.29%

Same numbers measured for previous set of patches [1], which improves
both memory consumption as well as planning time.

With PWJ disabled
 num_parts |               2               |              3
   |              4               |               5

-----------+-------------------------------+------------------------------+------------------------------+--------------------------------
         0 | s=4.68% md=18.17% pd=22.09%   | s=-2.54% md=12.00%
pd=13.81% | s=-2.02% md=3.84% pd=4.43%   | s=-69.14% md=11.06%
pd=126.87%
        10 | s=-24.85% md=20.42% pd=35.69% | s=-4.31% md=0.73%
pd=1.53%   | s=-14.97% md=0.32% pd=31.90% | s=-0.57% md=0.79% pd=0.50%
       100 | s=0.27% md=4.69% pd=1.55%     | s=4.16% md=0.29% pd=0.18%
   | s=11.76% md=0.85% pd=0.49%   | s=15.76% md=1.64% pd=2.32%
       500 | s=0.54% md=1.88% pd=1.81%     | s=9.36% md=1.17% pd=0.87%
   | s=21.45% md=0.74% pd=0.88%   | s=30.47% md=0.17% pd=1.17%
      1000 | s=3.22% md=1.36% pd=0.99%     | s=14.74% md=0.86%
pd=0.44%   | s=24.50% md=0.36% pd=0.31%   | s=27.97% md=0.27% pd=0.25%

With PWJ enabled
 num_parts |              2              |             3
|             4              |              5

-----------+-----------------------------+----------------------------+----------------------------+-----------------------------
         0 | s=11.07% md=19.28% pd=8.70% | s=-1.18% md=5.88% pd=4.31%
| s=-2.25% md=8.42% pd=3.77% | s=25.07% md=11.48% pd=3.87%
        10 | s=-9.07% md=2.65% pd=14.58% | s=0.55% md=3.10% pd=3.41%
| s=3.89% md=3.94% pd=3.79%  | s=7.25% md=2.87% pd=3.03%
       100 | s=-4.53% md=0.49% pd=8.53%  | s=2.24% md=4.24% pd=3.96%
| s=6.70% md=1.30% pd=2.08%  | s=9.09% md=1.39% pd=1.50%
       500 | s=-1.65% md=1.59% pd=1.44%  | s=6.31% md=0.89% pd=1.11%
| s=12.72% md=0.20% pd=0.29% | s=15.02% md=0.28% pd=0.83%
      1000 | s=1.53% md=1.01% pd=1.66%   | s=11.80% md=0.66% pd=0.71%
| s=16.23% md=0.58% pd=0.18% | s=17.16% md=0.67% pd=0.68%

There are a few things to notice
1. There is not much difference in the planning time improvement by
both the patchsets. But the patchset attached to [1] improves memory
consumption as well. So it looks more attractive.
2. The performance regressions usually coincide with higher standard
deviation. This indicates that both the performance gains or losses
seen at lower numbers of partitions and joins are not real and
possibly ignored. I have run the script multiple times but some or
other combination of lower number of partitions and lower number of
joins shows higher deviation and thus unstable results. I have not be
able to find a way where all the combinations show a stable result.

I think the first patch in the attached set is worth committing, just
to tighten things up.

[1] https://www.postgresql.org/message-id/CAExHW5t6NpGaif6ZO_P+L1cPZk27+Ye2LxfqRVXf0OiSsW9WSg@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat

Attachment
Hi Ashutosh,

On Tue, Feb 25, 2025 at 8:04 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Thu, Feb 20, 2025 at 5:28 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
> > On Tue, Feb 4, 2025 at 4:07 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
> > > If we are not interested in saving memory, there is a simpler way to
> > > improve planning time by adding a hash table per equivalence class to
> > > store the derived clauses, instead of a linked list, when the number
> > > of derived clauses is higher than a threshold (say 32 same as the
> > > threshold for join_rel_list. Maybe that approach will yield stable
> > > planning time.
>
> I implemented the above idea in attached patches.  I also added the
> following query, inspired from Alvaro's query, to summarise the
> results.
> with master_avgs as
> (select code_tag, num_parts, num_joins, pwj, avg(planning_time_ms)
> avg_pt, stddev(planning_time_ms) stddev_pt
> from msmts where code_tag = 'master'
> group by code_tag, num_parts, num_joins, pwj),
> patched_avgs as
> (select code_tag, num_parts, num_joins, pwj, avg(planning_time_ms)
> avg_pt, stddev(planning_time_ms) stddev_pt
> from msmts where code_tag = 'patched'
> group by code_tag, num_parts, num_joins, pwj)
> select num_parts,
> num_joins,
> format('s=%s%% md=%s%% pd=%s%%',
> ((m.avg_pt - p.avg_pt)/m.avg_pt * 100)::numeric(6, 2),
> (m.stddev_pt/m.avg_pt * 100)::numeric(6, 2),
> (p.stddev_pt/p.avg_pt * 100)::numeric(6, 2))
> from master_avgs m join patched_avgs p using (num_parts, num_joins,
> pwj) where not pwj order by 1, 2, 3;
> \crosstabview 1 2 3
>
> not pwj in the last line should be changed to pwj to get results with
> enable_partitionwise_join = true.
>
> With the attached patches, I observe following results
>
> With PWJ disabled
>  num_parts |               2               |              3
>    |             4              |              5
>
-----------+-------------------------------+------------------------------+----------------------------+-----------------------------
>          0 | s=-4.44% md=17.91% pd=23.05%  | s=-0.83% md=11.10%
> pd=19.58% | s=0.87% md=4.04% pd=7.91%  | s=-35.24% md=7.63% pd=9.69%
>         10 | s=30.13% md=118.18% pd=37.44% | s=-3.49% md=0.58%
> pd=0.49%   | s=-0.83% md=0.29% pd=0.35% | s=-0.24% md=0.35% pd=0.32%
>        100 | s=1.94% md=13.19% pd=4.08%    | s=-0.27% md=0.18%
> pd=0.44%   | s=7.04% md=3.05% pd=3.11%  | s=12.75% md=1.69% pd=0.81%
>        500 | s=4.39% md=1.71% pd=1.33%     | s=10.17% md=1.28%
> pd=1.90%   | s=23.04% md=0.24% pd=0.58% | s=30.87% md=0.30% pd=1.11%
>       1000 | s=4.27% md=1.21% pd=1.97%     | s=13.97% md=0.44%
> pd=0.79%   | s=24.05% md=0.63% pd=1.02% | s=30.77% md=0.77% pd=0.17%
>
>
> Each cell is a triple (s, md, pd) where s is improvement in planning
> time using the patches in % as compared to the master (higher the
> better), md = standard deviation as % of the average planning time on
> master, pd = is standard deviation as % of the average planning time
> with patches.
>
> With PWJ enabled
>  num_parts |              2               |              3
>   |              4              |              5
>
-----------+------------------------------+------------------------------+-----------------------------+------------------------------
>          0 | s=-94.25% md=6.98% pd=56.03% | s=44.10% md=141.13%
> pd=9.32% | s=42.71% md=46.00% pd=6.55% | s=-26.12% md=6.72% pd=15.20%
>         10 | s=-25.89% md=4.29% pd=63.75% | s=-1.34% md=3.15% pd=3.26%
>   | s=0.31% md=4.13% pd=4.34%   | s=-1.34% md=3.10% pd=6.73%
>        100 | s=-2.83% md=0.94% pd=1.31%   | s=-2.17% md=4.57% pd=4.41%
>   | s=0.98% md=1.59% pd=1.81%   | s=1.87% md=1.10% pd=0.79%
>        500 | s=1.57% md=3.01% pd=1.70%    | s=6.99% md=1.58% pd=1.68%
>   | s=11.11% md=0.24% pd=0.62%  | s=11.65% md=0.18% pd=0.90%
>       1000 | s=3.59% md=0.98% pd=1.78%    | s=10.83% md=0.88% pd=0.46%
>   | s=15.62% md=0.46% pd=0.13%  | s=16.38% md=0.63% pd=0.29%
>
> Same numbers measured for previous set of patches [1], which improves
> both memory consumption as well as planning time.
>
> With PWJ disabled
>  num_parts |               2               |              3
>    |              4               |               5
>
-----------+-------------------------------+------------------------------+------------------------------+--------------------------------
>          0 | s=4.68% md=18.17% pd=22.09%   | s=-2.54% md=12.00%
> pd=13.81% | s=-2.02% md=3.84% pd=4.43%   | s=-69.14% md=11.06%
> pd=126.87%
>         10 | s=-24.85% md=20.42% pd=35.69% | s=-4.31% md=0.73%
> pd=1.53%   | s=-14.97% md=0.32% pd=31.90% | s=-0.57% md=0.79% pd=0.50%
>        100 | s=0.27% md=4.69% pd=1.55%     | s=4.16% md=0.29% pd=0.18%
>    | s=11.76% md=0.85% pd=0.49%   | s=15.76% md=1.64% pd=2.32%
>        500 | s=0.54% md=1.88% pd=1.81%     | s=9.36% md=1.17% pd=0.87%
>    | s=21.45% md=0.74% pd=0.88%   | s=30.47% md=0.17% pd=1.17%
>       1000 | s=3.22% md=1.36% pd=0.99%     | s=14.74% md=0.86%
> pd=0.44%   | s=24.50% md=0.36% pd=0.31%   | s=27.97% md=0.27% pd=0.25%
>
> With PWJ enabled
>  num_parts |              2              |             3
> |             4              |              5
>
-----------+-----------------------------+----------------------------+----------------------------+-----------------------------
>          0 | s=11.07% md=19.28% pd=8.70% | s=-1.18% md=5.88% pd=4.31%
> | s=-2.25% md=8.42% pd=3.77% | s=25.07% md=11.48% pd=3.87%
>         10 | s=-9.07% md=2.65% pd=14.58% | s=0.55% md=3.10% pd=3.41%
> | s=3.89% md=3.94% pd=3.79%  | s=7.25% md=2.87% pd=3.03%
>        100 | s=-4.53% md=0.49% pd=8.53%  | s=2.24% md=4.24% pd=3.96%
> | s=6.70% md=1.30% pd=2.08%  | s=9.09% md=1.39% pd=1.50%
>        500 | s=-1.65% md=1.59% pd=1.44%  | s=6.31% md=0.89% pd=1.11%
> | s=12.72% md=0.20% pd=0.29% | s=15.02% md=0.28% pd=0.83%
>       1000 | s=1.53% md=1.01% pd=1.66%   | s=11.80% md=0.66% pd=0.71%
> | s=16.23% md=0.58% pd=0.18% | s=17.16% md=0.67% pd=0.68%
>
> There are a few things to notice
> 1. There is not much difference in the planning time improvement by
> both the patchsets. But the patchset attached to [1] improves memory
> consumption as well. So it looks more attractive.
> 2. The performance regressions usually coincide with higher standard
> deviation. This indicates that both the performance gains or losses
> seen at lower numbers of partitions and joins are not real and
> possibly ignored. I have run the script multiple times but some or
> other combination of lower number of partitions and lower number of
> joins shows higher deviation and thus unstable results. I have not be
> able to find a way where all the combinations show a stable result.

Thanks for the patch and the extensive benchmarking.

Would you please also share a simple, self-contained example that I
can use to reproduce and verify the performance improvements? It’s
helpful to first see the patch in action with a representative query,
and then refer to the more extensive benchmark results you've already
shared as needed. I'm also having a hard time telling from the scripts
which query was used to produce the numbers in the report.

Btw, in the commit message, you mention:

===
When there are thousands of partitions in a partitioned table, there
can be thousands of derived clauses in the list making it inefficient
for a lookup.
===

I haven’t found a description of how that many clauses end up in the
ec->ec_derived list. IIUC, it's create_join_clause() where the child
clauses get added, and it would be helpful to mention that, since that
also appears to be the hotspot your patch is addressing.

> I think the first patch in the attached set is worth committing, just
> to tighten things up.

I agree and am happy to commit it if there are no objections.

--
Thanks, Amit Langote



On Fri, Mar 14, 2025 at 5:36 PM Amit Langote <amitlangote09@gmail.com> wrote:

>
> Thanks for the patch and the extensive benchmarking.
>
> Would you please also share a simple, self-contained example that I
> can use to reproduce and verify the performance improvements? It’s
> helpful to first see the patch in action with a representative query,
> and then refer to the more extensive benchmark results you've already
> shared as needed. I'm also having a hard time telling from the scripts
> which query was used to produce the numbers in the report.
>

Here are steps
1. Run setup.sql attached in [1]. It will add a few helper functions
and create required tables (partitioned and non-partitioned ones).
2. The sheet named "rae data" attached to [1] has queries whose
performance is measured. They use the tables created by setup.sql.

You may further use the same scripts to run benchmarks.

> Btw, in the commit message, you mention:
>
> ===
> When there are thousands of partitions in a partitioned table, there
> can be thousands of derived clauses in the list making it inefficient
> for a lookup.
> ===
>
> I haven’t found a description of how that many clauses end up in the
> ec->ec_derived list. IIUC, it's create_join_clause() where the child
> clauses get added, and it would be helpful to mention that, since that
> also appears to be the hotspot your patch is addressing.

you are right. create_join_clause() adds the derived clauses for
partitions. Please note that the optimization, being modelled after
join rel list, is applicable to partitioned, non-partititoned cases as
well as with or without partitionwise join.

>
> > I think the first patch in the attached set is worth committing, just
> > to tighten things up.
>
> I agree and am happy to commit it if there are no objections.

Thanks.

[1] https://www.postgresql.org/message-id/CAExHW5vnwgTgfsCiNM7E4TnkxD1b_ZHPafNe1f041u=o131PYg@mail.gmail.com


--
Best Wishes,
Ashutosh Bapat



On Mon, Mar 17, 2025 at 5:47 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Fri, Mar 14, 2025 at 5:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Thanks for the patch and the extensive benchmarking.
> >
> > Would you please also share a simple, self-contained example that I
> > can use to reproduce and verify the performance improvements? It’s
> > helpful to first see the patch in action with a representative query,
> > and then refer to the more extensive benchmark results you've already
> > shared as needed. I'm also having a hard time telling from the scripts
> > which query was used to produce the numbers in the report.
> >
>
> Here are steps
> 1. Run setup.sql attached in [1]. It will add a few helper functions
> and create required tables (partitioned and non-partitioned ones).
> 2. The sheet named "rae data" attached to [1] has queries whose
> performance is measured. They use the tables created by setup.sql.
>
> You may further use the same scripts to run benchmarks.

Thanks for pointing me to that.

I ran a couple of benchmarks of my own as follows.

cat benchmark_amit.sh
for p in 0 16 32 64 128 256 512 1024; do
  echo -ne "$p\t";
  pgbench -i --partitions=$p > /dev/null 2>&1
  pgbench -n -T 30 -f /tmp/query.sql | grep latency | awk '{print $4}';
done

For a 3-way join:
cat /tmp/query.sql
select * from pgbench_accounts t1, pgbench_accounts t2,
pgbench_accounts t3 where t2.aid = t1.aid AND t3.aid = t2.aid;

nparts  master      patched     %change
0       35.508      36.066      1.571
16      66.79       67.704      1.368
32      67.774      68.179      0.598
64      51.023      50.471     -1.082
128     56.4        55.759     -1.137
256     70.134      68.401     -2.471
512     120.621     113.552    -5.861
1024    405.339     312.726    -22.848

For a 6-way jon
cat /tmp/query.sql
select * from pgbench_accounts t1, pgbench_accounts t2,
pgbench_accounts t3, pgbench_accounts t4, pgbench_accounts t5,
pgbench_accounts t6 where t2.aid = t1.aid AND t3.aid = t2.aid and
t4.aid = t3.aid and t5.aid = t4.aid and t6.aid = t5.aid;

nparts  master      patched     %change
0       66.144      64.932     -1.832
16      100.874     100.491    -0.380
32      104.645     104.536    -0.104
64      114.415     109.193    -4.564
128     145.422     130.458    -10.290
256     273.761     209.919    -23.320
512     1359.896    616.295    -54.681
1024    7183.765    2857.086   -60.229

-60% means 60% reduction in latency due to the patch.

As others have already found, performance improvements become
substantial as both partition count and join depth increase. The patch
seems to have minimal impact at low partition counts and low join
complexity -- base cases (e.g., 0–32 partitions, 3-way joins) are
essentially unchanged, which is good to see.

I haven’t measured memory increase from the patch, but let me know if
that has already been evaluated and shown not to be a showstopper.

I also noticed using perf that create_join_clause() is a hotspot when
running without the patch, especially at high partition counts (> 500)
and the more join relations.

Let me know if my methodology seems off or if the results look reasonable.

--
Thanks, Amit Langote



Hi Amit,


On Tue, Mar 18, 2025 at 4:02 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> I ran a couple of benchmarks of my own as follows.
>
> cat benchmark_amit.sh
> for p in 0 16 32 64 128 256 512 1024; do
>   echo -ne "$p\t";
>   pgbench -i --partitions=$p > /dev/null 2>&1
>   pgbench -n -T 30 -f /tmp/query.sql | grep latency | awk '{print $4}';
> done
>
> For a 3-way join:
> cat /tmp/query.sql
> select * from pgbench_accounts t1, pgbench_accounts t2,
> pgbench_accounts t3 where t2.aid = t1.aid AND t3.aid = t2.aid;
>
> nparts  master      patched     %change
> 0       35.508      36.066      1.571
> 16      66.79       67.704      1.368
> 32      67.774      68.179      0.598
> 64      51.023      50.471     -1.082
> 128     56.4        55.759     -1.137
> 256     70.134      68.401     -2.471
> 512     120.621     113.552    -5.861
> 1024    405.339     312.726    -22.848
>
> For a 6-way jon
> cat /tmp/query.sql
> select * from pgbench_accounts t1, pgbench_accounts t2,
> pgbench_accounts t3, pgbench_accounts t4, pgbench_accounts t5,
> pgbench_accounts t6 where t2.aid = t1.aid AND t3.aid = t2.aid and
> t4.aid = t3.aid and t5.aid = t4.aid and t6.aid = t5.aid;
>
> nparts  master      patched     %change
> 0       66.144      64.932     -1.832
> 16      100.874     100.491    -0.380
> 32      104.645     104.536    -0.104
> 64      114.415     109.193    -4.564
> 128     145.422     130.458    -10.290
> 256     273.761     209.919    -23.320
> 512     1359.896    616.295    -54.681
> 1024    7183.765    2857.086   -60.229
>
> -60% means 60% reduction in latency due to the patch.
>
> As others have already found, performance improvements become
> substantial as both partition count and join depth increase. The patch
> seems to have minimal impact at low partition counts and low join
> complexity -- base cases (e.g., 0–32 partitions, 3-way joins) are
> essentially unchanged, which is good to see.

I assume these are with enable_partitionwise_join = off since it's not
enabled in the query.

With lower number of partitions and joins the execution time will be
substantial compared to planning time hence the changes in execution
time will affect the changes in latency. So I agree that the
difference in numbers for lower number of partitions and joins is
noise. That agrees with my results posted a few emails before.

At a higher number of partitions and joins, the planning time
dominates the latency. Hence any variation in the planning time
dominates a variation in the latency. Thus the improvements seen here
are due to improvements in the planning time. Again that agrees with
my results in the previous email.

>
> I haven’t measured memory increase from the patch, but let me know if
> that has already been evaluated and shown not to be a showstopper.

There's a slight increase in memory consumption because of the hash
table but it's very minimal.

Here are memory numbers in kb (presented in the same format as before)
with pwj disabled
 num_parts |            2             |            3             |
        4             |              5

-----------+--------------------------+--------------------------+---------------------------+-----------------------------
         0 | s=0 md=15 pd=15          | s=0 md=21 pd=21          | s=0
md=27 pd=27           | s=0 md=33 pd=33
        10 | s=0 md=218 pd=218        | s=-20 md=455 pd=475      |
s=-20 md=868 pd=888       | s=-20 md=1697 pd=1717
       100 | s=-20 md=1824 pd=1844    | s=-80 md=3718 pd=3798    |
s=-160 md=6400 pd=6560    | s=-320 md=10233 pd=10553
       500 | s=-160 md=9395 pd=9555   | s=-320 md=20216 pd=20536 |
s=-640 md=35735 pd=36375  | s=-1280 md=60808 pd=62088
      1000 | s=-320 md=19862 pd=20182 | s=-640 md=45739 pd=46379 |
s=-1280 md=84210 pd=85490 | s=-2561 md=149740 pd=152301

each column is s=difference in memory consumed, md = memory consumed
without patch, pd = memory consumed with patch. -ve difference shows
increase in memory consumption.

with pwj enabled
 num_parts |            2             |             3              |
           4              |               5

-----------+--------------------------+----------------------------+-----------------------------+-------------------------------
         0 | s=0 md=15 pd=15          | s=0 md=21 pd=21            |
s=0 md=27 pd=27             | s=0 md=33 pd=33
        10 | s=0 md=365 pd=365        | s=-20 md=1198 pd=1218      |
s=-20 md=3571 pd=3591       | s=-20 md=10426 pd=10446
       100 | s=-21 md=3337 pd=3358    | s=-80 md=11237 pd=11317    |
s=-160 md=33845 pd=34005    | s=-320 md=99502 pd=99822
       500 | s=-160 md=17206 pd=17366 | s=-320 md=60096 pd=60416   |
s=-640 md=183306 pd=183946  | s=-1280 md=556705 pd=557985
      1000 | s=-320 md=36119 pd=36439 | s=-640 md=131457 pd=132097 |
s=-1280 md=404809 pd=406089 | s=-2561 md=1263664 pd=1266225

%wise this is 3-4% maximum.
>
> I also noticed using perf that create_join_clause() is a hotspot when
> running without the patch, especially at high partition counts (> 500)
> and the more join relations.

That's an interesting observation. Any hotspot in planning would show
up as hotspot in total execution time. So expected.

>
> Let me know if my methodology seems off or if the results look reasonable.

Thanks for benchmarking using a different method. The results agree
with my results.
--
Best Wishes,
Ashutosh Bapat



On Tue, Mar 18, 2025 at 8:48 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Tue, Mar 18, 2025 at 4:02 PM Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > I ran a couple of benchmarks of my own as follows.
> >
> > cat benchmark_amit.sh
> > for p in 0 16 32 64 128 256 512 1024; do
> >   echo -ne "$p\t";
> >   pgbench -i --partitions=$p > /dev/null 2>&1
> >   pgbench -n -T 30 -f /tmp/query.sql | grep latency | awk '{print $4}';
> > done
> >
> > For a 3-way join:
> > cat /tmp/query.sql
> > select * from pgbench_accounts t1, pgbench_accounts t2,
> > pgbench_accounts t3 where t2.aid = t1.aid AND t3.aid = t2.aid;
> >
> > nparts  master      patched     %change
> > 0       35.508      36.066      1.571
> > 16      66.79       67.704      1.368
> > 32      67.774      68.179      0.598
> > 64      51.023      50.471     -1.082
> > 128     56.4        55.759     -1.137
> > 256     70.134      68.401     -2.471
> > 512     120.621     113.552    -5.861
> > 1024    405.339     312.726    -22.848
> >
> > For a 6-way jon
> > cat /tmp/query.sql
> > select * from pgbench_accounts t1, pgbench_accounts t2,
> > pgbench_accounts t3, pgbench_accounts t4, pgbench_accounts t5,
> > pgbench_accounts t6 where t2.aid = t1.aid AND t3.aid = t2.aid and
> > t4.aid = t3.aid and t5.aid = t4.aid and t6.aid = t5.aid;
> >
> > nparts  master      patched     %change
> > 0       66.144      64.932     -1.832
> > 16      100.874     100.491    -0.380
> > 32      104.645     104.536    -0.104
> > 64      114.415     109.193    -4.564
> > 128     145.422     130.458    -10.290
> > 256     273.761     209.919    -23.320
> > 512     1359.896    616.295    -54.681
> > 1024    7183.765    2857.086   -60.229
> >
> > -60% means 60% reduction in latency due to the patch.
> >
> > As others have already found, performance improvements become
> > substantial as both partition count and join depth increase. The patch
> > seems to have minimal impact at low partition counts and low join
> > complexity -- base cases (e.g., 0–32 partitions, 3-way joins) are
> > essentially unchanged, which is good to see.
>
> I assume these are with enable_partitionwise_join = off since it's not
> enabled in the query.

Yes, those were with pwj=off.  FTR, numbers I get with pwj=on.

3-way:

nparts  master      patched     %change
0       38.407      34.675      -9.717
16      69.357      64.312      -7.274
32      70.027      67.079      -4.210
64      73.807      70.725      -4.176
128     83.875      81.945      -2.301
256     102.858     106.19       3.239
512     181.95      180.891     -0.582
1024    464.503     440.355     -5.199

6-way:

nparts  master      patched     %change
0       64.411      67.175       4.291
16      203.344     209.75       3.150
32      296.952     300.966      1.352
64      445.536     449.917      0.983
128     805.103     781.892     -2.883
256     1695.746    1574.346    -7.159
512     4743.16     4010.196    -15.453
1024    16772.454   12284.706   -26.757

So a bit less impressive than the improvements for pwj=off.  Also,
patch seems to make things worse for low partition counts (0-16) for
6-way joins, which I am not quite sure is within the noise range.
Have you noticed that too and, if yes, do you know what might be
causing it?

> > I haven’t measured memory increase from the patch, but let me know if
> > that has already been evaluated and shown not to be a showstopper.
>
> There's a slight increase in memory consumption because of the hash
> table but it's very minimal.
>
> Here are memory numbers in kb (presented in the same format as before)
> with pwj disabled
>  num_parts |            2             |            3             |
>         4             |              5
>
-----------+--------------------------+--------------------------+---------------------------+-----------------------------
>          0 | s=0 md=15 pd=15          | s=0 md=21 pd=21          | s=0
> md=27 pd=27           | s=0 md=33 pd=33
>         10 | s=0 md=218 pd=218        | s=-20 md=455 pd=475      |
> s=-20 md=868 pd=888       | s=-20 md=1697 pd=1717
>        100 | s=-20 md=1824 pd=1844    | s=-80 md=3718 pd=3798    |
> s=-160 md=6400 pd=6560    | s=-320 md=10233 pd=10553
>        500 | s=-160 md=9395 pd=9555   | s=-320 md=20216 pd=20536 |
> s=-640 md=35735 pd=36375  | s=-1280 md=60808 pd=62088
>       1000 | s=-320 md=19862 pd=20182 | s=-640 md=45739 pd=46379 |
> s=-1280 md=84210 pd=85490 | s=-2561 md=149740 pd=152301
>
> each column is s=difference in memory consumed, md = memory consumed
> without patch, pd = memory consumed with patch. -ve difference shows
> increase in memory consumption.
>
> with pwj enabled
>  num_parts |            2             |             3              |
>            4              |               5
>
-----------+--------------------------+----------------------------+-----------------------------+-------------------------------
>          0 | s=0 md=15 pd=15          | s=0 md=21 pd=21            |
> s=0 md=27 pd=27             | s=0 md=33 pd=33
>         10 | s=0 md=365 pd=365        | s=-20 md=1198 pd=1218      |
> s=-20 md=3571 pd=3591       | s=-20 md=10426 pd=10446
>        100 | s=-21 md=3337 pd=3358    | s=-80 md=11237 pd=11317    |
> s=-160 md=33845 pd=34005    | s=-320 md=99502 pd=99822
>        500 | s=-160 md=17206 pd=17366 | s=-320 md=60096 pd=60416   |
> s=-640 md=183306 pd=183946  | s=-1280 md=556705 pd=557985
>       1000 | s=-320 md=36119 pd=36439 | s=-640 md=131457 pd=132097 |
> s=-1280 md=404809 pd=406089 | s=-2561 md=1263664 pd=1266225
>
> %wise this is 3-4% maximum.

Ok, thanks for those.  Looks within acceptable range to me.

--
Thanks, Amit Langote



On Wed, Mar 19, 2025 at 8:22 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Tue, Mar 18, 2025 at 8:48 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > On Tue, Mar 18, 2025 at 4:02 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > >
> > > I ran a couple of benchmarks of my own as follows.
> > >
> > > cat benchmark_amit.sh
> > > for p in 0 16 32 64 128 256 512 1024; do
> > >   echo -ne "$p\t";
> > >   pgbench -i --partitions=$p > /dev/null 2>&1
> > >   pgbench -n -T 30 -f /tmp/query.sql | grep latency | awk '{print $4}';
> > > done
> > >
> > > For a 3-way join:
> > > cat /tmp/query.sql
> > > select * from pgbench_accounts t1, pgbench_accounts t2,
> > > pgbench_accounts t3 where t2.aid = t1.aid AND t3.aid = t2.aid;
> > >
> > > nparts  master      patched     %change
> > > 0       35.508      36.066      1.571
> > > 16      66.79       67.704      1.368
> > > 32      67.774      68.179      0.598
> > > 64      51.023      50.471     -1.082
> > > 128     56.4        55.759     -1.137
> > > 256     70.134      68.401     -2.471
> > > 512     120.621     113.552    -5.861
> > > 1024    405.339     312.726    -22.848
> > >
> > > For a 6-way jon
> > > cat /tmp/query.sql
> > > select * from pgbench_accounts t1, pgbench_accounts t2,
> > > pgbench_accounts t3, pgbench_accounts t4, pgbench_accounts t5,
> > > pgbench_accounts t6 where t2.aid = t1.aid AND t3.aid = t2.aid and
> > > t4.aid = t3.aid and t5.aid = t4.aid and t6.aid = t5.aid;
> > >
> > > nparts  master      patched     %change
> > > 0       66.144      64.932     -1.832
> > > 16      100.874     100.491    -0.380
> > > 32      104.645     104.536    -0.104
> > > 64      114.415     109.193    -4.564
> > > 128     145.422     130.458    -10.290
> > > 256     273.761     209.919    -23.320
> > > 512     1359.896    616.295    -54.681
> > > 1024    7183.765    2857.086   -60.229
> > >
> > > -60% means 60% reduction in latency due to the patch.
> > >
> > > As others have already found, performance improvements become
> > > substantial as both partition count and join depth increase. The patch
> > > seems to have minimal impact at low partition counts and low join
> > > complexity -- base cases (e.g., 0–32 partitions, 3-way joins) are
> > > essentially unchanged, which is good to see.
> >
> > I assume these are with enable_partitionwise_join = off since it's not
> > enabled in the query.
>
> Yes, those were with pwj=off.  FTR, numbers I get with pwj=on.
>
> 3-way:
>
> nparts  master      patched     %change
> 0       38.407      34.675      -9.717
> 16      69.357      64.312      -7.274
> 32      70.027      67.079      -4.210
> 64      73.807      70.725      -4.176
> 128     83.875      81.945      -2.301
> 256     102.858     106.19       3.239
> 512     181.95      180.891     -0.582
> 1024    464.503     440.355     -5.199
>
> 6-way:
>
> nparts  master      patched     %change
> 0       64.411      67.175       4.291
> 16      203.344     209.75       3.150
> 32      296.952     300.966      1.352
> 64      445.536     449.917      0.983
> 128     805.103     781.892     -2.883
> 256     1695.746    1574.346    -7.159
> 512     4743.16     4010.196    -15.453
> 1024    16772.454   12284.706   -26.757
>
> So a bit less impressive than the improvements for pwj=off.  Also,
> patch seems to make things worse for low partition counts (0-16) for
> 6-way joins, which I am not quite sure is within the noise range.
> Have you noticed that too and, if yes, do you know what might be
> causing it?

I have observed similar things with lower numbers of partitions. In my
observation, such results have coincided with relatively large
variance. With 0 partitions there is no difference in the code
behaviour irrespective of pwj ON or Off. Hence we don't expect any
variation in the numbers you posted yesterday and today when nparts =
0. But there it is. I have always observed that much variation in
planning time for one combination or the other.

With 6 -way join there will be 6 * 5 - 5 derived clauses in one
equivalence class, That's close to 32, which is the threshold to start
using a hash table. So some slight perturbation is expected around
that threshold. But given that it's lower than 32, that shouldn't
apply here.

--
Best Wishes,
Ashutosh Bapat



On Wed, Mar 19, 2025 at 3:31 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Wed, Mar 19, 2025 at 8:22 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Yes, those were with pwj=off.  FTR, numbers I get with pwj=on.
> >
> > 3-way:
> >
> > nparts  master      patched     %change
> > 0       38.407      34.675      -9.717
> > 16      69.357      64.312      -7.274
> > 32      70.027      67.079      -4.210
> > 64      73.807      70.725      -4.176
> > 128     83.875      81.945      -2.301
> > 256     102.858     106.19       3.239
> > 512     181.95      180.891     -0.582
> > 1024    464.503     440.355     -5.199
> >
> > 6-way:
> >
> > nparts  master      patched     %change
> > 0       64.411      67.175       4.291
> > 16      203.344     209.75       3.150
> > 32      296.952     300.966      1.352
> > 64      445.536     449.917      0.983
> > 128     805.103     781.892     -2.883
> > 256     1695.746    1574.346    -7.159
> > 512     4743.16     4010.196    -15.453
> > 1024    16772.454   12284.706   -26.757
> >
> > So a bit less impressive than the improvements for pwj=off.  Also,
> > patch seems to make things worse for low partition counts (0-16) for
> > 6-way joins, which I am not quite sure is within the noise range.
> > Have you noticed that too and, if yes, do you know what might be
> > causing it?
>
> I have observed similar things with lower numbers of partitions. In my
> observation, such results have coincided with relatively large
> variance. With 0 partitions there is no difference in the code
> behaviour irrespective of pwj ON or Off. Hence we don't expect any
> variation in the numbers you posted yesterday and today when nparts =
> 0. But there it is. I have always observed that much variation in
> planning time for one combination or the other.
>
> With 6 -way join there will be 6 * 5 - 5 derived clauses in one
> equivalence class, That's close to 32, which is the threshold to start
> using a hash table. So some slight perturbation is expected around
> that threshold. But given that it's lower than 32, that shouldn't
> apply here.

Ok, thanks for that analysis.  I don't think there's anything about
the patch that makes it particularly less suitable for pwj=on.

I read patch 0002 in detail last week and wrote a follow-up patch
(0003), mostly for cosmetic improvements, which I plan to squash into
0002.

I’ve also revised the commit messages for 0001 and 0002. Let me know
if that looks reasonable or if I’ve missed any credits.

--
Thanks, Amit Langote

Attachment
On Mon, Mar 24, 2025 at 7:23 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Ok, thanks for that analysis.  I don't think there's anything about
> the patch that makes it particularly less suitable for pwj=on.
>

I agree.

> I read patch 0002 in detail last week and wrote a follow-up patch
> (0003), mostly for cosmetic improvements, which I plan to squash into
> 0002.

For some reason v2-0003 didn't apply cleanly on my local branch.
Possibly v2-0002 is a modified version of my 0002. In order to not
disturb your patchset, I have attached my edits as a diff. If you find
those useful, apply them to 0003 and then squash it into 0002.

Here are some more cosmetic comments on 0003. A bit of discussion
might be needed. Hence did not edit myself.

-/* Hash table entry in ec_derives_hash. */
+/* Hash table entry used in ec_derives_hash. */

I think we don't need "used" here entry in hash table is good enough. But I am
ok even if it stays that way.

- add_derived_clauses(ec1, ec2->ec_derives_list);
+ /* Updates ec1's ec_derives_list and ec_derives_hash if present. */
+ ec_add_derived_clauses(ec1, ec2->ec_derives_list);

Suggestion in the attached diff.

ec1->ec_relids = bms_join(ec1->ec_relids, ec2->ec_relids);
ec1->ec_has_const |= ec2->ec_has_const;
/* can't need to set has_volatile */
@@ -396,7 +400,7 @@ process_equivalence(PlannerInfo *root,
/* just to avoid debugging confusion w/ dangling pointers: */
ec2->ec_members = NIL;
ec2->ec_sources = NIL;
- clear_ec_derived_clauses(ec2);
+ ec_clear_derived_clauses(ec2);

I pondered about this naming convention when naming the functions. But
it seems it's not used everywhere in this file OR I am not able to see
the underlying naming rule if any. So I used a mixed naming. Let's
keep your names though. I think they are better.

@@ -1403,6 +1403,17 @@ typedef struct JoinDomain
* entry: consider SELECT random() AS a, random() AS b ... ORDER BY b,a.
* So we record the SortGroupRef of the originating sort clause.
*
+ * Derived equality clauses between EC members are stored in ec_derives_list.

"clauses between EC members" doesn't read correctly - "clauses
equating EC members" seems better. We actually don't need to mention
"between EC members" at all - it's not relevant to the "size of the
list" which is the subject of this paragraph. What do you think?

+ * For small queries, this list is scanned directly during lookup. For larger
+ * queries -- e.g., with many partitions or joins -- a hash table
+ * (ec_derives_hash) is built for faster lookup. Both structures contain the
+ * same RestrictInfos and are maintained in parallel.

If only the list exists, there is nothing to maintain in parallel. The
sentence seems to indicate that both the hash table and the list are
always present (and maintained in parallel). How about dropping "Both
... parallel" and modifying the previous sentence as " ... faster
lookup when the list grows beyond a threshold" or something similar.
The next sentence anyway mentions that both the structures are
maintained.

+ We retain the list even
+ * when the hash is used to simplify serialization (e.g., in
+ * _outEquivalenceClass()) and support EquivalenceClass merging.

Thanks for the review.

--
Best Wishes,
Ashutosh Bapat

Attachment
On Tue, Mar 25, 2025 at 3:17 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Mon, Mar 24, 2025 at 7:23 PM Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > Ok, thanks for that analysis.  I don't think there's anything about
> > the patch that makes it particularly less suitable for pwj=on.
> >
>
> I agree.
>
> > I read patch 0002 in detail last week and wrote a follow-up patch
> > (0003), mostly for cosmetic improvements, which I plan to squash into
> > 0002.
>
> For some reason v2-0003 didn't apply cleanly on my local branch.
> Possibly v2-0002 is a modified version of my 0002.

Ah, right -- I probably tweaked your 0002 a bit before splitting out
my changes into a separate patch.

> In order to not
> disturb your patchset, I have attached my edits as a diff. If you find
> those useful, apply them to 0003 and then squash it into 0002.
>
> Here are some more cosmetic comments on 0003. A bit of discussion
> might be needed. Hence did not edit myself.
>
> -/* Hash table entry in ec_derives_hash. */
> +/* Hash table entry used in ec_derives_hash. */
>
> I think we don't need "used" here entry in hash table is good enough. But I am
> ok even if it stays that way.

Ok, let's drop "used".

> - add_derived_clauses(ec1, ec2->ec_derives_list);
> + /* Updates ec1's ec_derives_list and ec_derives_hash if present. */
> + ec_add_derived_clauses(ec1, ec2->ec_derives_list);
>
> Suggestion in the attached diff.

Makes sense, though I added it after a small wording tweak to avoid
implying that the hash tables are merged in some special way. So now
it reads:

-        /* Updates ec1's ec_derives_list and ec_derives_hash if present. */
+        /*
+         * Appends ec2's derived clauses to ec1->ec_derives_list and adds
+         * them to ec1->ec_derives_hash if present.
+         */

> @@ -396,7 +400,7 @@ process_equivalence(PlannerInfo *root,
> /* just to avoid debugging confusion w/ dangling pointers: */
> ec2->ec_members = NIL;
> ec2->ec_sources = NIL;
> - clear_ec_derived_clauses(ec2);
> + ec_clear_derived_clauses(ec2);
>
> I pondered about this naming convention when naming the functions. But
> it seems it's not used everywhere in this file OR I am not able to see
> the underlying naming rule if any. So I used a mixed naming. Let's
> keep your names though. I think they are better.

Got it -- I went with the ec_ prefix mainly to make the new additions
self-consistent, since the file doesn’t seem to follow a strict naming
pattern. Glad the names work for you. While at it, I also applied the
same naming convention to two new functions I hadn’t touched earlier
for some reason.

> @@ -1403,6 +1403,17 @@ typedef struct JoinDomain
> * entry: consider SELECT random() AS a, random() AS b ... ORDER BY b,a.
> * So we record the SortGroupRef of the originating sort clause.
> *
> + * Derived equality clauses between EC members are stored in ec_derives_list.
>
> "clauses between EC members" doesn't read correctly - "clauses
> equating EC members" seems better. We actually don't need to mention
> "between EC members" at all - it's not relevant to the "size of the
> list" which is the subject of this paragraph. What do you think?

OK, I agree -- we don't need to mention EC members here. I've updated
the comment to keep the focus on the list itself

> + * For small queries, this list is scanned directly during lookup. For larger
> + * queries -- e.g., with many partitions or joins -- a hash table
> + * (ec_derives_hash) is built for faster lookup. Both structures contain the
> + * same RestrictInfos and are maintained in parallel.
>
> If only the list exists, there is nothing to maintain in parallel. The
> sentence seems to indicate that both the hash table and the list are
> always present (and maintained in parallel). How about dropping "Both
> ... parallel" and modifying the previous sentence as " ... faster
> lookup when the list grows beyond a threshold" or something similar.
> The next sentence anyway mentions that both the structures are
> maintained.

Agree here too.  Here's the revised comment addressing these two points:

+ * Derived equality clauses are stored in ec_derives_list. For small queries,
+ * this list is scanned directly during lookup. For larger queries -- e.g.,
+ * with many partitions or joins -- a hash table (ec_derives_hash) is built
+ * when the list grows beyond a threshold, for faster lookup. When present,
+ * the hash table contains the same RestrictInfos and is maintained alongside
+ * the list. We retain the list even when the hash is used to simplify
+ * serialization (e.g., in _outEquivalenceClass()) and support
+ * EquivalenceClass merging.

I've merged my delta + your suggested changes as discussed above into 0002.

Btw, about ec_clear_derived_clauses():

@@ -749,7 +749,7 @@ remove_rel_from_eclass(EquivalenceClass *ec,
SpecialJoinInfo *sjinfo,
      * drop them.  (At this point, any such clauses would be base restriction
      * clauses, which we'd not need anymore anyway.)
      */
-    ec->ec_derives = NIL;
+    ec_clear_derived_clauses(ec);
 }

 /*
@@ -1544,8 +1544,7 @@ update_eclasses(EquivalenceClass *ec, int from, int to)
     list_free(ec->ec_members);
     ec->ec_members = new_members;

-    list_free(ec->ec_derives);
-    ec->ec_derives = NULL;
+    ec_clear_derived_clauses(ec);

We're losing that list_free() in the second hunk, aren't we?

There's also this comment:

+ * XXX: When thousands of partitions are involved, the list can become
+ * sizable. It might be worth freeing it explicitly in such cases.

So maybe ec_clear_derived_clauses() should take a free_list parameter,
to preserve the original behavior? What do you think?

--
Thanks, Amit Langote

Attachment
On Tue, Mar 25, 2025 at 12:58 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> -        /* Updates ec1's ec_derives_list and ec_derives_hash if present. */
> +        /*
> +         * Appends ec2's derived clauses to ec1->ec_derives_list and adds
> +         * them to ec1->ec_derives_hash if present.
> +         */

WFM.

>
> > @@ -396,7 +400,7 @@ process_equivalence(PlannerInfo *root,
> > /* just to avoid debugging confusion w/ dangling pointers: */
> > ec2->ec_members = NIL;
> > ec2->ec_sources = NIL;
> > - clear_ec_derived_clauses(ec2);
> > + ec_clear_derived_clauses(ec2);
> >
> > I pondered about this naming convention when naming the functions. But
> > it seems it's not used everywhere in this file OR I am not able to see
> > the underlying naming rule if any. So I used a mixed naming. Let's
> > keep your names though. I think they are better.
>
> Got it -- I went with the ec_ prefix mainly to make the new additions
> self-consistent, since the file doesn’t seem to follow a strict naming
> pattern. Glad the names work for you. While at it, I also applied the
> same naming convention to two new functions I hadn’t touched earlier
> for some reason.

WFM.


>
> + * Derived equality clauses are stored in ec_derives_list. For small queries,
> + * this list is scanned directly during lookup. For larger queries -- e.g.,
> + * with many partitions or joins -- a hash table (ec_derives_hash) is built
> + * when the list grows beyond a threshold, for faster lookup. When present,
> + * the hash table contains the same RestrictInfos and is maintained alongside
> + * the list. We retain the list even when the hash is used to simplify
> + * serialization (e.g., in _outEquivalenceClass()) and support
> + * EquivalenceClass merging.
>
> I've merged my delta + your suggested changes as discussed above into 0002.
>

LGTM.

> Btw, about ec_clear_derived_clauses():
>
> @@ -749,7 +749,7 @@ remove_rel_from_eclass(EquivalenceClass *ec,
> SpecialJoinInfo *sjinfo,
>       * drop them.  (At this point, any such clauses would be base restriction
>       * clauses, which we'd not need anymore anyway.)
>       */
> -    ec->ec_derives = NIL;
> +    ec_clear_derived_clauses(ec);
>  }
>
>  /*
> @@ -1544,8 +1544,7 @@ update_eclasses(EquivalenceClass *ec, int from, int to)
>      list_free(ec->ec_members);
>      ec->ec_members = new_members;
>
> -    list_free(ec->ec_derives);
> -    ec->ec_derives = NULL;
> +    ec_clear_derived_clauses(ec);
>
> We're losing that list_free() in the second hunk, aren't we?
>
> There's also this comment:
>
> + * XXX: When thousands of partitions are involved, the list can become
> + * sizable. It might be worth freeing it explicitly in such cases.
>
> So maybe ec_clear_derived_clauses() should take a free_list parameter,
> to preserve the original behavior? What do you think?

Well spotted. How about just calling list_free() in
ec_clear_derived_clauses() to simplify things. I mean list_free()
might spend some cycles under remove_rel_from_eclass() and
process_equivalence() freeing the array but that should be ok. Just
setting it to NIL by itself looks fine. If we bundle it in a function
with a flag, we will need to explain why/when to free list and when to
not. That's unnecessary complexity I feel. In other places where the
structures have potential to grow in size, we have resorted to freeing
them rather than just forgetting them. For example, we free appinfos
in try_partitionwise_join() or child_relids.

The list shouldn't be referenced anywhere else, so it should be safe
to free it. Note that I thought list_concat() used by
process_equivalence() would reuse the memory allocated to
ec2->ec_derives_list but it doesn't. I verified that by setting the
threshold to 0, thus forcing the hash table always and running a
regression suite. It runs without any segfaults. I don't see any
change in time required to run regression.

PFA patchset
0001, 0002 are same as your patchset except some of my edits to the
commit message. Please feel free to accept or reject the edits.
0003 adds list_free() to ec_clear_derived_clauses()
--
Best Wishes,
Ashutosh Bapat

Attachment
On Tue, Mar 25, 2025 at 6:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Tue, Mar 25, 2025 at 12:58 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Btw, about ec_clear_derived_clauses():
> >
> > @@ -749,7 +749,7 @@ remove_rel_from_eclass(EquivalenceClass *ec,
> > SpecialJoinInfo *sjinfo,
> >       * drop them.  (At this point, any such clauses would be base restriction
> >       * clauses, which we'd not need anymore anyway.)
> >       */
> > -    ec->ec_derives = NIL;
> > +    ec_clear_derived_clauses(ec);
> >  }
> >
> >  /*
> > @@ -1544,8 +1544,7 @@ update_eclasses(EquivalenceClass *ec, int from, int to)
> >      list_free(ec->ec_members);
> >      ec->ec_members = new_members;
> >
> > -    list_free(ec->ec_derives);
> > -    ec->ec_derives = NULL;
> > +    ec_clear_derived_clauses(ec);
> >
> > We're losing that list_free() in the second hunk, aren't we?
> >
> > There's also this comment:
> >
> > + * XXX: When thousands of partitions are involved, the list can become
> > + * sizable. It might be worth freeing it explicitly in such cases.
> >
> > So maybe ec_clear_derived_clauses() should take a free_list parameter,
> > to preserve the original behavior? What do you think?
>
> Well spotted. How about just calling list_free() in
> ec_clear_derived_clauses() to simplify things. I mean list_free()
> might spend some cycles under remove_rel_from_eclass() and
> process_equivalence() freeing the array but that should be ok. Just
> setting it to NIL by itself looks fine. If we bundle it in a function
> with a flag, we will need to explain why/when to free list and when to
> not. That's unnecessary complexity I feel. In other places where the
> structures have potential to grow in size, we have resorted to freeing
> them rather than just forgetting them. For example, we free appinfos
> in try_partitionwise_join() or child_relids.
>
> The list shouldn't be referenced anywhere else, so it should be safe
> to free it. Note that I thought list_concat() used by
> process_equivalence() would reuse the memory allocated to
> ec2->ec_derives_list but it doesn't. I verified that by setting the
> threshold to 0, thus forcing the hash table always and running a
> regression suite. It runs without any segfaults. I don't see any
> change in time required to run regression.
>
> PFA patchset
> 0001, 0002 are same as your patchset except some of my edits to the
> commit message. Please feel free to accept or reject the edits.

Thanks, I've noted your suggestions.

> 0003 adds list_free() to ec_clear_derived_clauses()

Thanks, I've merged it into 0002, with this blurb in its commit
message to describe it:

    The new ec_clear_derived_clauses() always frees ec_derives_list, even
    though some of the original code paths that cleared the old ec_derives
    field did not. This ensures consistent cleanup and avoids leaking
    memory when the list grows large.

I needed to do this though ;)

-    ec->ec_derives_list = NIL;
     list_free(ec->ec_derives_list);
+    ec->ec_derives_list = NIL;

--
Thanks, Amit Langote

Attachment
On Tue, Mar 25, 2025 at 5:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Tue, Mar 25, 2025 at 6:36 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > On Tue, Mar 25, 2025 at 12:58 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > Btw, about ec_clear_derived_clauses():
> > >
> > > @@ -749,7 +749,7 @@ remove_rel_from_eclass(EquivalenceClass *ec,
> > > SpecialJoinInfo *sjinfo,
> > >       * drop them.  (At this point, any such clauses would be base restriction
> > >       * clauses, which we'd not need anymore anyway.)
> > >       */
> > > -    ec->ec_derives = NIL;
> > > +    ec_clear_derived_clauses(ec);
> > >  }
> > >
> > >  /*
> > > @@ -1544,8 +1544,7 @@ update_eclasses(EquivalenceClass *ec, int from, int to)
> > >      list_free(ec->ec_members);
> > >      ec->ec_members = new_members;
> > >
> > > -    list_free(ec->ec_derives);
> > > -    ec->ec_derives = NULL;
> > > +    ec_clear_derived_clauses(ec);
> > >
> > > We're losing that list_free() in the second hunk, aren't we?
> > >
> > > There's also this comment:
> > >
> > > + * XXX: When thousands of partitions are involved, the list can become
> > > + * sizable. It might be worth freeing it explicitly in such cases.
> > >
> > > So maybe ec_clear_derived_clauses() should take a free_list parameter,
> > > to preserve the original behavior? What do you think?
> >
> > Well spotted. How about just calling list_free() in
> > ec_clear_derived_clauses() to simplify things. I mean list_free()
> > might spend some cycles under remove_rel_from_eclass() and
> > process_equivalence() freeing the array but that should be ok. Just
> > setting it to NIL by itself looks fine. If we bundle it in a function
> > with a flag, we will need to explain why/when to free list and when to
> > not. That's unnecessary complexity I feel. In other places where the
> > structures have potential to grow in size, we have resorted to freeing
> > them rather than just forgetting them. For example, we free appinfos
> > in try_partitionwise_join() or child_relids.
> >
> > The list shouldn't be referenced anywhere else, so it should be safe
> > to free it. Note that I thought list_concat() used by
> > process_equivalence() would reuse the memory allocated to
> > ec2->ec_derives_list but it doesn't. I verified that by setting the
> > threshold to 0, thus forcing the hash table always and running a
> > regression suite. It runs without any segfaults. I don't see any
> > change in time required to run regression.
> >
> > PFA patchset
> > 0001, 0002 are same as your patchset except some of my edits to the
> > commit message. Please feel free to accept or reject the edits.
>
> Thanks, I've noted your suggestions.
>
> > 0003 adds list_free() to ec_clear_derived_clauses()
>
> Thanks, I've merged it into 0002, with this blurb in its commit
> message to describe it:
>
>     The new ec_clear_derived_clauses() always frees ec_derives_list, even
>     though some of the original code paths that cleared the old ec_derives
>     field did not. This ensures consistent cleanup and avoids leaking
>     memory when the list grows large.
>
> I needed to do this though ;)
>
> -    ec->ec_derives_list = NIL;
>      list_free(ec->ec_derives_list);
> +    ec->ec_derives_list = NIL;

Silly me. I reran regression by setting the threshold to 0 and still
didn't get segmentation fault or an increase in regression run time.
So the change is safe.

--
Best Wishes,
Ashutosh Bapat



Hi,
In earlier benchmark runs we saw that some combinations of lower number of joins and lower number of partitions showed planning time higher with patch compared to without patch. Those coincided with higher deviation in measurement. In an offline chat David Rowley suggested taking an average of many runs for those cases. Here are results with a lesser number of partitions, where average is taken over thousands of runs. Planning time of a query was sampled as many times as the number of runs that fit in 30s and then averages were calculated. This exercise was repeated twice.

Each cell is a triple (s, md, pd) where s is improvement in planning
time using the patches in % as compared to the master (higher the
better), md = standard deviation as % of the average planning time on
master, pd = is standard deviation as % of the average planning time
with patches.
Exercise 1
 num_parts |             2              |             3              |             4              |             5              
-----------+----------------------------+----------------------------+----------------------------+----------------------------
         0 | s=-0.96% md=7.63% pd=9.43% | s=-0.68% md=5.94% pd=5.65% | s=-0.29% md=5.15% pd=5.67% | s=0.19% md=6.25% pd=5.21%
        16 | s=-3.07% md=6.57% pd=3.56% | s=-1.52% md=1.29% pd=1.21% | s=-0.58% md=1.78% pd=1.47% | s=0.41% md=1.84% pd=1.51%
        32 | s=-1.40% md=2.58% pd=2.77% | s=-0.23% md=3.21% pd=1.16% | s=2.86% md=1.83% pd=1.78%  | s=4.16% md=1.23% pd=1.62%
        64 | s=-0.09% md=2.10% pd=2.11% | s=0.22% md=1.75% pd=4.23%  | s=5.86% md=1.87% pd=1.38%  | s=10.86% md=1.06% pd=0.84%
       128 | s=-1.15% md=2.79% pd=2.48% | s=3.45% md=1.67% pd=2.08%  | s=9.63% md=0.88% pd=1.98%  | s=17.33% md=1.46% pd=1.59%
       256 | s=-2.88% md=2.79% pd=2.53% | s=5.87% md=0.79% pd=1.80%  | s=13.63% md=1.07% pd=1.58% | s=23.57% md=0.78% pd=0.69%

planning time improvement with PWJ=on
 num_parts |             2              |             3              |             4              |             5              
-----------+----------------------------+----------------------------+----------------------------+----------------------------
         0 | s=-0.35% md=7.98% pd=7.32% | s=-0.84% md=6.40% pd=6.36% | s=-0.18% md=5.68% pd=5.62% | s=-1.13% md=4.64% pd=6.16%
        16 | s=-1.61% md=2.62% pd=2.26% | s=-0.42% md=1.27% pd=1.18% | s=-0.27% md=2.17% pd=2.23% | s=-0.86% md=1.46% pd=1.44%
        32 | s=-1.73% md=6.91% pd=6.56% | s=0.64% md=2.75% pd=2.12%  | s=1.03% md=2.23% pd=1.44%  | s=0.80% md=2.63% pd=1.71%
        64 | s=-1.12% md=4.13% pd=4.25% | s=1.81% md=5.22% pd=2.20%  | s=0.99% md=2.14% pd=2.20%  | s=1.64% md=1.24% pd=1.57%
       128 | s=-0.94% md=2.68% pd=3.27% | s=-2.82% md=4.00% pd=4.73% | s=2.43% md=2.28% pd=2.11%  | s=5.96% md=2.01% pd=1.23%
       256 | s=0.31% md=3.21% pd=2.58%  | s=3.74% md=3.48% pd=1.89%  | s=4.18% md=1.36% pd=1.36%  | s=5.49% md=0.52% pd=1.24%

Exercise 2
planning time improvement with PWJ=off
 num_parts |              2              |             3              |             4              |             5              
-----------+-----------------------------+----------------------------+----------------------------+----------------------------
         0 | s=1.26% md=16.39% pd=16.52% | s=-0.68% md=4.66% pd=5.15% | s=-1.62% md=4.38% pd=4.21% | s=-0.25% md=4.10% pd=4.29%
        16 | s=-4.40% md=8.67% pd=7.18%  | s=-1.08% md=1.44% pd=1.83% | s=1.12% md=0.75% pd=1.50%  | s=0.14% md=0.56% pd=0.46%
        32 | s=-1.17% md=5.86% pd=5.86%  | s=-0.39% md=1.22% pd=0.75% | s=3.33% md=0.40% pd=0.84%  | s=5.89% md=0.35% pd=0.65%
        64 | s=0.41% md=4.36% pd=4.77%   | s=0.93% md=0.29% pd=0.30%  | s=3.38% md=0.51% pd=0.28%  | s=9.63% md=0.19% pd=1.08%
       128 | s=-0.50% md=3.40% pd=3.06%  | s=1.54% md=0.31% pd=0.58%  | s=12.47% md=0.36% pd=0.62% | s=19.30% md=0.18% pd=0.24%
       256 | s=-3.42% md=2.93% pd=2.80%  | s=2.94% md=0.29% pd=0.36%  | s=12.50% md=0.22% pd=0.27% | s=23.14% md=0.30% pd=0.55%

planning time improvement with PWJ=on
 num_parts |             2              |             3              |             4              |             5              
-----------+----------------------------+----------------------------+----------------------------+----------------------------
         0 | s=1.47% md=6.05% pd=6.20%  | s=-0.72% md=5.14% pd=5.09% | s=0.53% md=4.17% pd=4.39%  | s=-0.13% md=4.03% pd=3.58%
        16 | s=-0.94% md=3.91% pd=3.51% | s=-1.46% md=2.24% pd=2.47% | s=1.12% md=0.67% pd=1.50%  | s=-0.64% md=0.76% pd=0.70%
        32 | s=-0.50% md=6.46% pd=6.11% | s=1.73% md=2.78% pd=1.30%  | s=3.47% md=0.77% pd=0.53%  | s=3.13% md=0.59% pd=0.31%
        64 | s=-1.19% md=3.52% pd=4.08% | s=-0.47% md=1.49% pd=0.48% | s=-1.04% md=0.67% pd=0.61% | s=5.87% md=2.50% pd=0.54%
       128 | s=-1.67% md=2.22% pd=2.45% | s=1.54% md=1.16% pd=1.21%  | s=4.26% md=0.43% pd=1.24%  | s=4.31% md=0.82% pd=0.51%
       256 | s=-2.10% md=1.66% pd=2.85% | s=0.96% md=1.04% pd=0.46%  | s=3.18% md=0.55% pd=0.53%  | s=7.41% md=1.08% pd=0.30%

The averages are now more stable than the previous exercise. Some regressions seen with the first exercise are not seen with the other and some improvements seens with the first exercise are not seen with the second and vice versa. The regressions present in both the exercises will not be seen, if I repeat the exercise a few more times. So I think those regressions or improvements seen with lower number of joins and lower number of partitions aren't real and they are mostly within noise level. However, the improvements seen with higher numbers of joins and partitions are always there irrespective of the number of times I repeat the exercise. Please note that we have only tried partitions upto 256. Previous measurements have seen stable improvements in case of higher number of partitions.

--
Best Wishes,
Ashutosh Bapat
Hi,

On Wed, Mar 26, 2025 at 4:59 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
>
> The averages are now more stable than the previous exercise. Some regressions seen with the first exercise are not
seenwith the other and some improvements seens with the first exercise are not seen with the second and vice versa. The
regressionspresent in both the exercises will not be seen, if I repeat the exercise a few more times. So I think those
regressionsor improvements seen with lower number of joins and lower number of partitions aren't real and they are
mostlywithin noise level. However, the improvements seen with higher numbers of joins and partitions are always there
irrespectiveof the number of times I repeat the exercise. Please note that we have only tried partitions upto 256.
Previousmeasurements have seen stable improvements in case of higher number of partitions. 
>

Further, I experimented with hash table size. Attached images have
four graphs for planning time and planner's memory consumption
measured for a 3-way join for initial has table sizes of 64, 128 and
256 respectively.
1. Planning time vs number of partitions with PWJ enabled
2. planning time vs number of partitions with PWJ disabled
3. Memory consumed vs number of partitions with PWJ enabled
4. Memory consumed vs number of partitions with PWJ disabled

Also find attached the spreadsheet containing the measurements and
also the charts.

In the graphs, one can observe that the lines corresponding to all
three hash table sizes are inseparable. That leads to a conclusion
that the planning time or memory consumption doesn't change with hash
table size.

As a result I am keeping the initial hash table size = 256L, same as
the previously attached patches.

--
Best Wishes,
Ashutosh Bapat

Attachment
On Thu, 27 Mar 2025 at 23:27, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> Further, I experimented with hash table size. Attached images have
> four graphs for planning time and planner's memory consumption
> measured for a 3-way join for initial has table sizes of 64, 128 and
> 256 respectively.

I put together a benchmarking script so I could learn the performance
of this patch. See attached.

It does not seem that surprising that you're not seeing much
difference in memory consumption. I believe your test case has a
single EquivalenceClass.  The hashtable bucket size is 40 bytes on my
machine, so going between 256*40 and 64*40 isn't much memory.  My
off-list mention of using 64 buckets as the initial size was because
you're switching to the hashing method at 32 items. If you made the
table 32, then it's guaranteed to need to be enlarged, so that's not
good. If you make it 64, then the worst-case fillfactor is 50% rather
than 12.5% with 256 elements.

Performing lookups on an appropriately sized hash table is going to
perform better than lookups on a sparse table. The reason for this is
that hash table probes rarely ever have a predictable memory access
pattern, and the larger the bucket array is, the more chance of having
a backend stall while fetching cache lines from some higher cache
level or RAM.  So, IMO, using 256, you're leaving performance on the
table and paying in RAM for the privilege.

You might not be too concerned about the memory because you've done
the tests, but testing with one EC and calling it good seems naive to
me.  I recall one query that Tom posted when I was working on the EC
index stuff for 3373c7155 that had over 1000 EquivalenceClasses. I
don't know how many of those would have had > 32 ec_derives entries,
but check [1] if you want to see.

I experimented by applying your v4 along with 0001-0003 of Yuya's v35
patchset from [2].  See the attached bz2 for my results run on an AMD
Zen2 machine. The CREATE TABLE statement is in the attached script.

If I run: select testname,parts,avg(joins),sum(plan_time) as
plan_time,avg(mem_used) mem_used,avg(mem_alloc) mem_alloc from
bench_results where testname not ilike '%pwj%' and testname ilike
'%yuya%' group by 1,2 order by parts,testname;
There are no results > 32 parts where 256 elements are faster than 64.
64 averages about 1% faster.  That could be noise, but slightly less
memory seems quite attractive to me when there's some evidence that
also comes with better performance.

Just to explain the names of the tests in the results:
v4_yuya_v35-0001-0003_list_free = Your v4 patch with Yuya's 0001-0003
with the fix for the pfree on the list.
v4_64buckets_yuya_v35-0001-0003_list_free is the same but with 64
bucket ec_derives_hash table. For the test, the names should be fairly
self-explanatory.  If the name has "pwj" in it, then I had
partitionwise-joins enabled, if not, I had it disabled. master is
5d5f41581.

David

[1] https://www.postgresql.org/message-id/6970.1545327857%40sss.pgh.pa.us
[2] https://postgr.es/m/CAJ2pMkZ2soD_99UTGkvg4_fX=PAvd7oDNYUMOksqbEMzpdeJAA@mail.gmail.com

Attachment
Hi David,

On Fri, Mar 28, 2025 at 8:32 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Performing lookups on an appropriately sized hash table is going to
> perform better than lookups on a sparse table. The reason for this is
> that hash table probes rarely ever have a predictable memory access
> pattern, and the larger the bucket array is, the more chance of having
> a backend stall while fetching cache lines from some higher cache
> level or RAM.  So, IMO, using 256, you're leaving performance on the
> table and paying in RAM for the privilege.

I don't know much about cache lines but searching for cachelines shows
that they are 64 or 128 bytes long. The hash entry here is 40 bytes
long, so at most 1 or 3 entries would fit in a cache line. My
understanding could be wrong but it seems that we will incur a cache
line fault almost for every entry that we fetch even if we fetch the
entries in a tight loop. If we are performing other operations
in-between like what will happen with the patch, the cache lines are
going to fault anyway. So it seems the size of the hash table or its
sparseness wouldn't affect timing much. Please correct me if I am
wrong.

>
> You might not be too concerned about the memory because you've done
> the tests, but testing with one EC and calling it good seems naive to
> me.  I recall one query that Tom posted when I was working on the EC
> index stuff for 3373c7155 that had over 1000 EquivalenceClasses. I
> don't know how many of those would have had > 32 ec_derives entries,
> but check [1] if you want to see.

Comparing root->join_rel_hash with EC->ec_derives_hash in the context
of initial hash table size is a thinko on my part. It's less likely
that there will be 1000 subqueries (requiring 1000 PlannerInfos) each
with more than 32 join rels than a query with 1000 equivalence classes
in one PlannerInfo with more than 32 ec_derives. So if using a small
initial hash table doesn't impact performance negatively, why not save
some memory. Thinking more about it, we know the size of
ec_derives_list when creating the hash table and we are using
simplehash which uses its own fillfactor and its own logic to expand
the hash table, I think we should just use the length of
ec_derives_list as the initial size. What do you think?


--
Best Wishes,
Ashutosh Bapat



On Fri, Mar 28, 2025 at 12:01 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Comparing root->join_rel_hash with EC->ec_derives_hash in the context
> of initial hash table size is a thinko on my part. It's less likely
> that there will be 1000 subqueries (requiring 1000 PlannerInfos) each
> with more than 32 join rels than a query with 1000 equivalence classes
> in one PlannerInfo with more than 32 ec_derives. So if using a small
> initial hash table doesn't impact performance negatively, why not save
> some memory. Thinking more about it, we know the size of
> ec_derives_list when creating the hash table and we are using
> simplehash which uses its own fillfactor and its own logic to expand
> the hash table, I think we should just use the length of
> ec_derives_list as the initial size. What do you think?
>

PFA patches. 0001 and 0002 are the same as the previous set. 0003
changes the initial hash table size to the length of ec_derives.

--
Best Wishes,
Ashutosh Bapat

Attachment
On 2025-Mar-28, David Rowley wrote:

> I experimented by applying your v4 along with 0001-0003 of Yuya's v35
> patchset from [2].  See the attached bz2 for my results run on an AMD
> Zen2 machine. The CREATE TABLE statement is in the attached script.

Eyeballing these results, unless I am misreading them, the patch brings
zero benefit.

With PWJ off, there's no plan time improvement and no memory consumption
improvement either; only with Watari-san's patch there's an improvement
in plan time (and with Watari's patch, I see negligible difference
between the case with 64 buckets and the other case, which I assume has
256 buckets).  But the "master" vs. "v4_patch" cases seem to be
essentially identical.

With PWJ on, the situation is the same.  master_pwj looks pretty much
the same as v4_pwj.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



On Fri, Mar 28, 2025 at 10:46 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2025-Mar-28, David Rowley wrote:
>
> > I experimented by applying your v4 along with 0001-0003 of Yuya's v35
> > patchset from [2].  See the attached bz2 for my results run on an AMD
> > Zen2 machine. The CREATE TABLE statement is in the attached script.
>
> Eyeballing these results, unless I am misreading them, the patch brings
> zero benefit.
>
> With PWJ off, there's no plan time improvement and no memory consumption
> improvement either; only with Watari-san's patch there's an improvement
> in plan time (and with Watari's patch, I see negligible difference
> between the case with 64 buckets and the other case, which I assume has
> 256 buckets).  But the "master" vs. "v4_patch" cases seem to be
> essentially identical.
>
> With PWJ on, the situation is the same.  master_pwj looks pretty much
> the same as v4_pwj.

My patch optimizes create_join_clause() which is invoked when planning
index scans. David's script does not create any indexes - implicit or
explicit. My script for example has the partition key as primary key.
So this is not surprising. If the script is modified to create the
partitioned table with the primary key, it gives similar results as
mine. That also explains why he doesn't see any changes in memory
consumption.

EC members are created irrespective of whether there are indexes or
not and Yuya's first three patches are related to ec_members handling.
So those show some improvement with David's script.

--
Best Wishes,
Ashutosh Bapat



On Sat, 29 Mar 2025 at 05:46, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> PFA patches. 0001 and 0002 are the same as the previous set. 0003
> changes the initial hash table size to the length of ec_derives.

I'm just not following the logic in making it the length of the
ec_derives List. If you have 32 buckets and try to insert 32 elements,
you're guaranteed to need a resize after inserting 28 elements. See
the grow_threshold logic SH_UPDATE_PARAMETERS(). The point of making
it 64 was to ensure the table is never unnecessarily sparse and to
also ensure we make it at least big enough for the minimum number of
ec_derives that we're about to insert.

Looking more closely at the patch's ec_add_clause_to_derives_hash()
function, I see you're actually making two hash table entries for each
RestrictInfo, so without any em_is_const members, you'll insert 64
entries into the hash table with a ec_derives list of 32, in which
case 64 buckets isn't enough and the table will end up growing to 128
elements.

I think you'd be better off coming up with some logic like putting the
lowest pointer value'd EM first in the key and ensure that all lookups
do that too by wrapping the lookups in some helper function. That'll
half the number of hash table entries at the cost of some very cheap
comparisons.

David



On Mon, Mar 31, 2025 at 8:27 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sat, 29 Mar 2025 at 05:46, Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > PFA patches. 0001 and 0002 are the same as the previous set. 0003
> > changes the initial hash table size to the length of ec_derives.
>
> I'm just not following the logic in making it the length of the
> ec_derives List. If you have 32 buckets and try to insert 32 elements,
> you're guaranteed to need a resize after inserting 28 elements. See
> the grow_threshold logic SH_UPDATE_PARAMETERS(). The point of making
> it 64 was to ensure the table is never unnecessarily sparse and to
> also ensure we make it at least big enough for the minimum number of
> ec_derives that we're about to insert.

If I am reading SH_CREATE correctly, it will initially set the size to
32/.9 = 36 and in SH_UPDATE_PARAMETERS will set grow_threshold to 36 *
.9 = 32. So it will leave some room even after inserting the initial
elements. But looking at SH_INSERT_HASH_INTERNAL(), it will soon
expand the table even if there's space. We certainly need a size much
more than 32. 32 is an arbitrary/empirical threshold to create a hash
table. Using that threshold as initial hash table size means the table
size would be arbitrary too. Using twice the  length of
ec_derives_list seems more reasonable since the length will decide the
initial number of entries.

>
> Looking more closely at the patch's ec_add_clause_to_derives_hash()
> function, I see you're actually making two hash table entries for each
> RestrictInfo, so without any em_is_const members, you'll insert 64
> entries into the hash table with a ec_derives list of 32, in which
> case 64 buckets isn't enough and the table will end up growing to 128
> elements.
>

Yes, that's right.

> I think you'd be better off coming up with some logic like putting the
> lowest pointer value'd EM first in the key and ensure that all lookups
> do that too by wrapping the lookups in some helper function. That'll
> half the number of hash table entries at the cost of some very cheap
> comparisons.

That's a good suggestion. I searched for C standard documentation
which specifies that the pointer comparison, especially inequality, is
stable and safe to use. But I didn't find any. While according to the
C standard, the result
of comparison between pointers within the same array or a struct is
specified, that between pointers from two different objects is
unspecified. The existing code relies on the EM pointers being stable
and also relies on equality between
them to be stable. It has withstood the test of time and a variety of
compilers. Hence I think it should be safe to rely on pointer
comparisons being stable. But since I didn't find any documentation
which confirms it, I have left those changes as a separate patch. Some
internet sources discussing pointer comparison can be found at [1],
[2] (which mentions the C standard but doesn't provide a link).

PFA the next patchset

0001, 0002 are same as in the previous set
0003 changes the initial hash table size to
length(ec->ec_derives_list) * 4 to accomodate for commuted entries.
0004 uses canonical keys per your suggestion and also reduces the
initial hash table size to  length(ec->ec_derives_list) * 2.

When the number of initial elements to be added to the hash table is
known, I see users of simplehash.h using that as an estimate rather
than the nearest power of two. Hence following the logic here.

[1]
https://stackoverflow.com/questions/59516346/how-does-pointer-comparison-work-in-c-is-it-ok-to-compare-pointers-that-dont-p
[2] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Pointer-Comparison.html

--
Best Wishes,
Ashutosh Bapat

Attachment
On Mon, Mar 31, 2025 at 7:00 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Mon, Mar 31, 2025 at 8:27 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Sat, 29 Mar 2025 at 05:46, Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > > PFA patches. 0001 and 0002 are the same as the previous set. 0003
> > > changes the initial hash table size to the length of ec_derives.
> >
> > I'm just not following the logic in making it the length of the
> > ec_derives List. If you have 32 buckets and try to insert 32 elements,
> > you're guaranteed to need a resize after inserting 28 elements. See
> > the grow_threshold logic SH_UPDATE_PARAMETERS(). The point of making
> > it 64 was to ensure the table is never unnecessarily sparse and to
> > also ensure we make it at least big enough for the minimum number of
> > ec_derives that we're about to insert.
>
> If I am reading SH_CREATE correctly, it will initially set the size to
> 32/.9 = 36 and in SH_UPDATE_PARAMETERS will set grow_threshold to 36 *
> .9 = 32. So it will leave some room even after inserting the initial
> elements. But looking at SH_INSERT_HASH_INTERNAL(), it will soon
> expand the table even if there's space. We certainly need a size much
> more than 32. 32 is an arbitrary/empirical threshold to create a hash
> table. Using that threshold as initial hash table size means the table
> size would be arbitrary too. Using twice the  length of
> ec_derives_list seems more reasonable since the length will decide the
> initial number of entries.

I think David’s suggestion to use 64 as the fixed initial size is
simpler and more predictable. Since list_length * 2 will always be >=
64 anyway, unless we expect clause counts to grow significantly right
after the threshold, the fixed size avoids the need to reason about
sizing heuristics and keeps the logic clearer.

It also lines up well with David’s point -- 64 strikes a good balance
for both memory usage and CPU efficiency. Unless there’s a strong
reason to favor dynamic sizing, I’d prefer to stick with the fixed 64.

> > I think you'd be better off coming up with some logic like putting the
> > lowest pointer value'd EM first in the key and ensure that all lookups
> > do that too by wrapping the lookups in some helper function. That'll
> > half the number of hash table entries at the cost of some very cheap
> > comparisons.
>
> That's a good suggestion. I searched for C standard documentation
> which specifies that the pointer comparison, especially inequality, is
> stable and safe to use. But I didn't find any. While according to the
> C standard, the result
> of comparison between pointers within the same array or a struct is
> specified, that between pointers from two different objects is
> unspecified. The existing code relies on the EM pointers being stable
> and also relies on equality between
> them to be stable. It has withstood the test of time and a variety of
> compilers. Hence I think it should be safe to rely on pointer
> comparisons being stable. But since I didn't find any documentation
> which confirms it, I have left those changes as a separate patch. Some
> internet sources discussing pointer comparison can be found at [1],
> [2] (which mentions the C standard but doesn't provide a link).
>
> PFA the next patchset
>
> 0001, 0002 are same as in the previous set
> 0003 changes the initial hash table size to
> length(ec->ec_derives_list) * 4 to accomodate for commuted entries.
> 0004 uses canonical keys per your suggestion and also reduces the
> initial hash table size to  length(ec->ec_derives_list) * 2.

Thanks for the new patches.

As David suggested off-list, it seems better to have a static inline
function for the key canonicalization logic than to duplicate it in
the insert and lookup paths, as done in your 0004.

Also, using macros for the ec_derives_hash threshold and initial size
seems cleaner than hardcoding the values.

I’ve combined your 0003 and 0004 into the attached 0003, which:

* Adds the canonicalization logic via fill_ec_derives_key().

* Replaces literal values for threshold and initial size with macros,
defined near the key and entry types.

* Updates code comments to reflect the new design.

I wasn’t sure why you removed this comment:

- * We do not attempt a second lookup with EMs swapped when using the hash
- * table; such clauses are inserted under both orderings at the time of
- * insertion.

We still avoid the second lookup, but the reason has changed -- we now
canonicalize keys during both insertion and lookup, which makes the
second probe unnecessary. So I rewrote it as:

+ * We do not attempt a second lookup with EMs swapped, because the key is
+ * canonicalized during both insertion and lookup.

0003’s commit message includes a blurb I plan to paste into the main
patch’s message (with minor tweaks) when we squash the patches.

--
Thanks, Amit Langote

Attachment
On Tue, Apr 1, 2025 at 1:32 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> I think David’s suggestion to use 64 as the fixed initial size is
> simpler and more predictable. Since list_length * 2 will always be >=
> 64 anyway, unless we expect clause counts to grow significantly right
> after the threshold, the fixed size avoids the need to reason about
> sizing heuristics and keeps the logic clearer.
>
> It also lines up well with David’s point -- 64 strikes a good balance
> for both memory usage and CPU efficiency. Unless there’s a strong
> reason to favor dynamic sizing, I’d prefer to stick with the fixed 64.

Consider following table and query (based on a similar table in
join.sql, which exercises find_derived_clause_for_ec_member() code
path.
#create table fkest (x integer, x10 integer, x10b integer, x100
integer, unique(x, x10, x100), foreign key (x, x10b, x100) references
fkest(x, x10, x100));
#select 'select * from fkest f1 join ' || string_agg(format('fkest f%s
on (f1.x = f%s.x and f1.x10 = f%s.x10b and f1.x100 = f%s.x100)', i, i,
i , i), ' join ') || ' where f1.x100 = 2' query from
generate_series(2, 100) i; \gset
#explain (summary) :query;

This is a 100-way self-join between foreign key and referenced key and
one of the foreign keys being set to a constant. This exercises the
case of derived clauses with constant EM.

When planning this query, all the 100 derived clauses containing the
constant EM are created first and then they are searched one by one
for every EM. Thus when the hash table is created in
find_derived_clause_for_ec_member()->ec_search_derived_clause_for_ems(),
the length of ec_derives_list is already 100, so the hash table will
be expanded while it's being filled up the first time, if we use
constant 64 as initial hash table size. This can be avoided if we use
list_length() * 2. The pattern for create_join_clause() however is
search then insert - so it will always create the hash table with
initial size 64 right when the list length is 32. Both these cases are
served well if we base the initial hash table size as a multiple of
list_length(ec_derives_list).

For the record, without the patch this query takes about 5800ms on
average for planning on my laptop. With the patch the planning time
reduces to about 5400ms - 6-7% of improvement if my approximate math
is correct.

>
> As David suggested off-list, it seems better to have a static inline
> function for the key canonicalization logic than to duplicate it in
> the insert and lookup paths, as done in your 0004.

Static inline function to fill the key in canonical form is a good
idea. Thanks for the patch.

>
> * Replaces literal values for threshold and initial size with macros,
> defined near the key and entry types.

I don't see this in your attached patches. Am I missing something?
It's still using list_length() for initial hash table size. But with
my explanation above, I believe we will keep it that way.

>
> I wasn’t sure why you removed this comment:
>
> - * We do not attempt a second lookup with EMs swapped when using the hash
> - * table; such clauses are inserted under both orderings at the time of
> - * insertion.

When I read it, I didn't understand why we mentioned a second lookup,
so I dropped it. But now I see that the comment is in the context of
two comparisons being done when using list. I have rephrased the
comment a bit to make this comparison clear.


>
> 0003’s commit message includes a blurb I plan to paste into the main
> patch’s message (with minor tweaks) when we squash the patches.

Slight correction in the first sentence of that blurb message.

Derived clauses are now stored in the ec_derives_hash using canonicalized
keys: the EquivalenceMember with lower memory address is always placed
in em1, and ...

+/*
+ * fill_ec_derives_key
+ * Compute a canonical key for ec_derives_hash lookup or insertion.
+ *
+ * Derived clauses are looked up using a pair of EquivalenceMembers and a
+ * parent EquivalenceClass. To avoid storing or searching for both EM
orderings,
+ * we canonicalize the key:
+ *
+ * - For clauses involving two non-constant EMs, we order the EMs by address
+ * and place the lower one first.
+ * - For clauses involving a constant EM, the caller must pass the non-constant
+ * EM as leftem; we then set em1 = NULL and em2 = leftem.
+ */
+static inline void
+fill_ec_derives_key(ECDerivesKey *key,
+ EquivalenceMember *leftem,
+ EquivalenceMember *rightem,
+ EquivalenceClass *parent_ec)
+{
+ Assert(leftem); /* Always required for lookup or insertion */
+
+ /*
+ * Clauses with a constant EM are always stored and looked up using only
+ * the non-constant EM, with the other key slot set to NULL.
+ */

This comment seems to overlap with what's already there in the
prologue. However "Clauses with a constant EM are always stored and
looked up using the non-constant EM" is non-overlapping part. This
part is also repeated in ec_add_clause_to_derives_hash(), which I
think is a better place for this comment. Changed in the attached
patch.

*/
Assert(!rinfo->left_em->em_is_const);
+ /*
+ * Clauses containing a constant are never considered redundant, so
+ * parent_ec is not set.
+ */
+ Assert(!rinfo->parent_ec || !rinfo->right_em->em_is_const);

These two Assertions are applicable to all the derived clauses but are
added to a function which deals with only hash table. If an EC doesn't
have enough derived clauses to create a hash table these assertions
won't be checked. The reason they are here is because we are using
these properties to create canonical hash table key. Should we move
them to ec_add_derived_clause() for better coverage?

I have also removed some comments repeated in the function prologue
and function body.

0001, 0002 and 0003 are the same as your set. 0004 contains my changes
in a separate patch so that it's easy to review those.

--
Best Wishes,
Ashutosh Bapat

Attachment
On Wed, Apr 2, 2025 at 1:58 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Tue, Apr 1, 2025 at 1:32 PM Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > I think David’s suggestion to use 64 as the fixed initial size is
> > simpler and more predictable. Since list_length * 2 will always be >=
> > 64 anyway, unless we expect clause counts to grow significantly right
> > after the threshold, the fixed size avoids the need to reason about
> > sizing heuristics and keeps the logic clearer.
> >
> > It also lines up well with David’s point -- 64 strikes a good balance
> > for both memory usage and CPU efficiency. Unless there’s a strong
> > reason to favor dynamic sizing, I’d prefer to stick with the fixed 64.
>
> Consider following table and query (based on a similar table in
> join.sql, which exercises find_derived_clause_for_ec_member() code
> path.
> #create table fkest (x integer, x10 integer, x10b integer, x100
> integer, unique(x, x10, x100), foreign key (x, x10b, x100) references
> fkest(x, x10, x100));
> #select 'select * from fkest f1 join ' || string_agg(format('fkest f%s
> on (f1.x = f%s.x and f1.x10 = f%s.x10b and f1.x100 = f%s.x100)', i, i,
> i , i), ' join ') || ' where f1.x100 = 2' query from
> generate_series(2, 100) i; \gset
> #explain (summary) :query;
>
> This is a 100-way self-join between foreign key and referenced key and
> one of the foreign keys being set to a constant. This exercises the
> case of derived clauses with constant EM.
>
> When planning this query, all the 100 derived clauses containing the
> constant EM are created first and then they are searched one by one
> for every EM. Thus when the hash table is created in
> find_derived_clause_for_ec_member()->ec_search_derived_clause_for_ems(),
> the length of ec_derives_list is already 100, so the hash table will
> be expanded while it's being filled up the first time, if we use
> constant 64 as initial hash table size. This can be avoided if we use
> list_length() * 2. The pattern for create_join_clause() however is
> search then insert - so it will always create the hash table with
> initial size 64 right when the list length is 32. Both these cases are
> served well if we base the initial hash table size as a multiple of
> list_length(ec_derives_list).
>
> For the record, without the patch this query takes about 5800ms on
> average for planning on my laptop. With the patch the planning time
> reduces to about 5400ms - 6-7% of improvement if my approximate math
> is correct.

Yeah, I figured the constant EM case might end up adding a bunch of
derived clauses before the hash table is built, but this example
really helps illustrate that, thanks.

It does seem like a special case --
generate_base_implied_equalities_const() dumps in a whole bunch of
clauses up front, unlike the usual search-then-insert pattern in
create_join_clause().

After our chat with David, I think using list_length(ec_derives_list)
to size the hash table is reasonable. Given how simplehash rounds up
-- dividing by the fillfactor and rounding to the next power of two --
we still end up with 64 buckets around the threshold. So the dynamic
sizing behaves pretty predictably and doesn't seem like a bad choice
after all.

> > * Replaces literal values for threshold and initial size with macros,
> > defined near the key and entry types.
>
> I don't see this in your attached patches. Am I missing something?
> It's still using list_length() for initial hash table size. But with
> my explanation above, I believe we will keep it that way.

Oops, looks like I created those (v6) patches before adding those macro changes.

> > I wasn’t sure why you removed this comment:
> >
> > - * We do not attempt a second lookup with EMs swapped when using the hash
> > - * table; such clauses are inserted under both orderings at the time of
> > - * insertion.
>
> When I read it, I didn't understand why we mentioned a second lookup,
> so I dropped it. But now I see that the comment is in the context of
> two comparisons being done when using list. I have rephrased the
> comment a bit to make this comparison clear.

Looks good.

> > 0003’s commit message includes a blurb I plan to paste into the main
> > patch’s message (with minor tweaks) when we squash the patches.
>
> Slight correction in the first sentence of that blurb message.
>
> Derived clauses are now stored in the ec_derives_hash using canonicalized
> keys: the EquivalenceMember with lower memory address is always placed
> in em1, and ...

Looks good.

> +/*
> + * fill_ec_derives_key
> + * Compute a canonical key for ec_derives_hash lookup or insertion.
> + *
> + * Derived clauses are looked up using a pair of EquivalenceMembers and a
> + * parent EquivalenceClass. To avoid storing or searching for both EM
> orderings,
> + * we canonicalize the key:
> + *
> + * - For clauses involving two non-constant EMs, we order the EMs by address
> + * and place the lower one first.
> + * - For clauses involving a constant EM, the caller must pass the non-constant
> + * EM as leftem; we then set em1 = NULL and em2 = leftem.
> + */
> +static inline void
> +fill_ec_derives_key(ECDerivesKey *key,
> + EquivalenceMember *leftem,
> + EquivalenceMember *rightem,
> + EquivalenceClass *parent_ec)
> +{
> + Assert(leftem); /* Always required for lookup or insertion */
> +
> + /*
> + * Clauses with a constant EM are always stored and looked up using only
> + * the non-constant EM, with the other key slot set to NULL.
> + */
>
> This comment seems to overlap with what's already there in the
> prologue. However "Clauses with a constant EM are always stored and
> looked up using the non-constant EM" is non-overlapping part. This
> part is also repeated in ec_add_clause_to_derives_hash(), which I
> think is a better place for this comment. Changed in the attached
> patch.

Ok, those comment changes look good overall.

> */
> Assert(!rinfo->left_em->em_is_const);
> + /*
> + * Clauses containing a constant are never considered redundant, so
> + * parent_ec is not set.
> + */
> + Assert(!rinfo->parent_ec || !rinfo->right_em->em_is_const);
>
> These two Assertions are applicable to all the derived clauses but are
> added to a function which deals with only hash table. If an EC doesn't
> have enough derived clauses to create a hash table these assertions
> won't be checked. The reason they are here is because we are using
> these properties to create canonical hash table key. Should we move
> them to ec_add_derived_clause() for better coverage?

Yes, good idea.

> I have also removed some comments repeated in the function prologue
> and function body.
>
> 0001, 0002 and 0003 are the same as your set. 0004 contains my changes
> in a separate patch so that it's easy to review those.

Here's v7 in which I have merged 0003 and 0004 into 0002.

You'll see that threshold now uses a macro and
list_length(ec->ec_derives_list) is passed as initial size.

I'm feeling good about this version, but let me know if you have any
further thoughts / comments.

--
Thanks, Amit Langote

Attachment
On Wed, Apr 2, 2025 at 1:00 PM Amit Langote <amitlangote09@gmail.com> wrote:

>
> I'm feeling good about this version, but let me know if you have any
> further thoughts / comments.

Thanks for incorporating the changes and fixing initial hash table size.

+ #define EC_DERIVES_HASH_THRESHOLD 32

Given that the constant is being used only at a single place, we don't
need a macro. But I am not against the macro.

PFA patch set with some minor edits in 0003. Also I have edited commit
message of 0001 and 0002.

In the commit messages of 0002,
1. mentioning that the lookup happens only for join clause generation
is not accurate, since we lookup EM = constant clauses as well which
are not join clauses.
2. In the second paragraph em1 and em2 are mentioned without
mentioning what are they. I have rephrased it so as to avoid
mentioning names of structure member.

--
Best Wishes,
Ashutosh Bapat

Attachment
On Wed, Apr 2, 2025 at 9:52 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Wed, Apr 2, 2025 at 1:00 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I'm feeling good about this version, but let me know if you have any
> > further thoughts / comments.
>
> Thanks for incorporating the changes and fixing initial hash table size.
>
> + #define EC_DERIVES_HASH_THRESHOLD 32
>
> Given that the constant is being used only at a single place, we don't
> need a macro. But I am not against the macro.

Yeah, let's keep it, because it documents well.

> PFA patch set with some minor edits in 0003. Also I have edited commit
> message of 0001 and 0002.
>
> In the commit messages of 0002,
> 1. mentioning that the lookup happens only for join clause generation
> is not accurate, since we lookup EM = constant clauses as well which
> are not join clauses.
> 2. In the second paragraph em1 and em2 are mentioned without
> mentioning what are they. I have rephrased it so as to avoid
> mentioning names of structure member.

Incorporated, thanks.

I'll plan to commit these tomorrow barring objections.

--
Thanks, Amit Langote

Attachment
On Thu, Apr 3, 2025 at 12:28 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Apr 2, 2025 at 9:52 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > On Wed, Apr 2, 2025 at 1:00 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > I'm feeling good about this version, but let me know if you have any
> > > further thoughts / comments.
> >
> > Thanks for incorporating the changes and fixing initial hash table size.
> >
> > + #define EC_DERIVES_HASH_THRESHOLD 32
> >
> > Given that the constant is being used only at a single place, we don't
> > need a macro. But I am not against the macro.
>
> Yeah, let's keep it, because it documents well.
>
> > PFA patch set with some minor edits in 0003. Also I have edited commit
> > message of 0001 and 0002.
> >
> > In the commit messages of 0002,
> > 1. mentioning that the lookup happens only for join clause generation
> > is not accurate, since we lookup EM = constant clauses as well which
> > are not join clauses.
> > 2. In the second paragraph em1 and em2 are mentioned without
> > mentioning what are they. I have rephrased it so as to avoid
> > mentioning names of structure member.
>
> Incorporated, thanks.
>
> I'll plan to commit these tomorrow barring objections.

I’ve now marked this as committed after pushing the patches earlier today.

I realize the CF entry was originally about the project to reduce
memory usage during partitionwise join planning, but we ended up
committing something else. I suppose we can create a new entry if and
when we pick that original work back up.

--
Thanks, Amit Langote



On Fri, Apr 4, 2025 at 2:04 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Thu, Apr 3, 2025 at 12:28 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Wed, Apr 2, 2025 at 9:52 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > > On Wed, Apr 2, 2025 at 1:00 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > > I'm feeling good about this version, but let me know if you have any
> > > > further thoughts / comments.
> > >
> > > Thanks for incorporating the changes and fixing initial hash table size.
> > >
> > > + #define EC_DERIVES_HASH_THRESHOLD 32
> > >
> > > Given that the constant is being used only at a single place, we don't
> > > need a macro. But I am not against the macro.
> >
> > Yeah, let's keep it, because it documents well.
> >
> > > PFA patch set with some minor edits in 0003. Also I have edited commit
> > > message of 0001 and 0002.
> > >
> > > In the commit messages of 0002,
> > > 1. mentioning that the lookup happens only for join clause generation
> > > is not accurate, since we lookup EM = constant clauses as well which
> > > are not join clauses.
> > > 2. In the second paragraph em1 and em2 are mentioned without
> > > mentioning what are they. I have rephrased it so as to avoid
> > > mentioning names of structure member.
> >
> > Incorporated, thanks.
> >
> > I'll plan to commit these tomorrow barring objections.
>
> I’ve now marked this as committed after pushing the patches earlier today.

Thanks a lot.

>
> I realize the CF entry was originally about the project to reduce
> memory usage during partitionwise join planning, but we ended up
> committing something else. I suppose we can create a new entry if and
> when we pick that original work back up.

Will create a new CF entry just to keep the patch floated.

--
Best Wishes,
Ashutosh Bapat



On Fri, Apr 4, 2025 at 5:48 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Fri, Apr 4, 2025 at 2:04 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I’ve now marked this as committed after pushing the patches earlier today.
>
> Thanks a lot.

Thank you too for working on it.

> > I realize the CF entry was originally about the project to reduce
> > memory usage during partitionwise join planning, but we ended up
> > committing something else. I suppose we can create a new entry if and
> > when we pick that original work back up.
>
> Will create a new CF entry just to keep the patch floated.

Sounds good.

Saving memory where we can does seem worthwhile, as long as the
approach stays simple. If there’s any doubt on that front, maybe it’s
worth spending a bit more time to see if things can be simplified.

--
Thanks, Amit Langote