Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> On Thu, 21 Nov 2024 at 13:03, Toto guyoyg <thomas.bessou@hotmail.fr> wrote:
>> It looks like this could be improved/fixed by either/all of:
>>
>> 1. Using a hashset (or sort + binary search) for recheck (past a certain array length or even always) instead of
alwayssearching linearly
> IIUC, hashed "= ANY()" expressions were already implemented with
> commit 50e17ad2 (released in PG14) - look for
> EEOP_HASHED_SCALARARRAYOP in expression handling code.
I checked into that and verified that this test case isn't reaching
EEOP_HASHED_SCALARARRAYOP, because that commit only addressed the
scenario where the array argument of =ANY is a constant. (A
post-const-folding constant, but still a constant.)
To apply that optimization here, where the array is an output of a
subplan, we'd need some mechanism whereby ExecEvalHashedScalarArrayOp
could find out when the array has changed underneath it. That's
probably possible (by somehow extending the chgParam signaling logic),
but the details aren't obvious. The planner's decision about when
to apply hashing would become much less obvious, too.
>> 2. Fixing planner assuming that all arrays are of size 10, using instead actual or estimated sizes.
> IIUC, this was also already implemented with commit 9391f715 (released in PG17).
That's not helpful here either, since the arrays are built on-the-fly
rather than being stored in table columns (where stats could be
collected on them). You could imagine installing bespoke logic for
array_agg() that looks at the estimated rowcount for the sub-select,
perhaps.
The bottom line here is that improving these queries would take a
significant amount of work, and they just aren't very compelling
examples that seem to justify such effort. Folding the output of
a subquery into an array is largely an anti-pattern in SQL in the
first place.
regards, tom lane